<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Thu, Aug 2, 2018 at 1:42 PM Kotresh Hiremath Ravishankar <<a href="mailto:khiremat@redhat.com">khiremat@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee <span dir="ltr"><<a href="mailto:atin.mukherjee83@gmail.com" target="_blank">atin.mukherjee83@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><span class="m_-8916643001973143584gmail-"><div dir="ltr">On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <<a href="mailto:khiremat@redhat.com" target="_blank">khiremat@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez <span dir="ltr"><<a href="mailto:xhernandez@redhat.com" target="_blank">xhernandez@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><span><div dir="ltr">On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee <<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>> wrote:<br></div></span><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee <<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I just went through the nightly regression report of brick mux runs and here's what I can summarize.<br><br>=========================================================================================================================================================================<br>Fails only with brick-mux<br>=========================================================================================================================================================================<br>tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400 secs. Refer <a href="https://fstat.gluster.org/failure/209?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all" target="_blank">https://fstat.gluster.org/failure/209?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all</a>, specifically the latest report <a href="https://build.gluster.org/job/regression-test-burn-in/4051/consoleText" target="_blank">https://build.gluster.org/job/regression-test-burn-in/4051/consoleText</a> . Wasn't timing out as frequently as it was till 12 July. But since 27 July, it has timed out twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400 secs isn't sufficient enough (Mohit?)<br><br>tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/814/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/814/console</a>) - Test fails only in brick-mux mode, AI on Atin to look at and get back.<br><br>tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/813/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/813/console</a>) - Seems like failed just twice in last 30 days as per <a href="https://fstat.gluster.org/failure/251?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all" target="_blank">https://fstat.gluster.org/failure/251?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all</a>. Need help from AFR team.<br><br>tests/bugs/quota/bug-1293601.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/812/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/812/console</a>) - Hasn't failed after 26 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)<br><br>tests/bitrot/bug-1373520.t - (<a href="https://build.gluster.org/job/regression-test-with-multiplex/811/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/811/console</a>) - Hasn't failed after 27 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)<br></div></blockquote><div><br></div><div>I see this has failed in day before yesterday's regression run as well (and I could reproduce it locally with brick mux enabled). The test fails in healing a file within a particular time period.</div><div><br></div><div><pre class="m_-8916643001973143584gmail-m_3899595027729854932m_-7168429344193105316m_5479165422519518122m_-6008408460366831088gmail-console-output"><span class="m_-8916643001973143584gmail-m_3899595027729854932m_-7168429344193105316m_5479165422519518122m_-6008408460366831088gmail-timestamp"><b>15:55:19</b> </span>not ok 25 Got "0" instead of "512", LINENUM:55
<span class="m_-8916643001973143584gmail-m_3899595027729854932m_-7168429344193105316m_5479165422519518122m_-6008408460366831088gmail-timestamp"><b>15:55:19</b> </span>FAILED COMMAND: 512 path_size /d/backends/patchy5/FILE1</pre></div><div>Need EC dev's help here.<br></div></div></div></blockquote><div><br></div></span><div>I'm not sure where the problem is exactly. I've seen that when the test fails, self-heal is attempting to heal the file, but when the file is accessed, an Input/Output error is returned, aborting heal. I've checked that a heal is attempted every time the file is accessed, but it fails always. This error seems to come from bit-rot stub xlator.</div><div><br></div><div>When in this situation, if I stop and start the volume, self-heal immediately heals the files. It seems like an stale state that is kept by the stub xlator, preventing the file from being healed.</div><div><br></div><div>Adding bit-rot maintainers for help on this one.</div></div></div></blockquote><div><br></div><div>Bitrot-stub marks the file as corrupted in inode_ctx. But when the file and it's hardlink are deleted from that brick and a lookup is done</div><div>on the file, it cleans up the marker on getting ENOENT. This is part of recovery steps, and only md-cache is disabled during the process.<br></div><div>Is there any other perf xlators that needs to be disabled for this scenario to expect a lookup/revalidate on the brick where</div><div>the back end file is deleted?<br></div></div></div></div></blockquote><div><br></div></span><div>But the same test doesn't fail with brick multiplexing not enabled. Do we know why?</div></div></div></blockquote><div>Don't know, something to do with perf xlators I suppose. It's not repdroduced on my local system with brick-mux enabled as well. But it's happening on Xavis' system.</div><div><br></div><div>Xavi,</div><div>Could you try with the patch [1] and let me know whether it fixes the issue.<br></div></div></div></div></blockquote><div><br></div><div>With the additional performance xlators disabled still happens.</div><div><br></div><div>The only thing that I've observed is that if I add a sleep just before stopping the volume, the test seems to pass always. Maybe there are some background updates going on ? (ec does background updates, but I'm not sure how this can be related with the Input/Output error accessing the brick file).</div><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div><br></div><div>[1] <a href="https://review.gluster.org/#/c/20619/1" target="_blank">https://review.gluster.org/#/c/20619/1</a><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="m_-8916643001973143584gmail-"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><span class="m_-8916643001973143584gmail-m_3899595027729854932m_-7168429344193105316HOEnZb"><font color="#888888"><div><br></div><div>Xavi</div></font></span><span><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br>tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core, not sure if related to brick mux or not, so not sure if brick mux is culprit here or not. Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/806/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/806/console</a> . Seems to be a glustershd crash. Need help from AFR folks.<br><br>=========================================================================================================================================================================<br>Fails for non-brick mux case too<br>=========================================================================================================================================================================<br>tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very often, with out brick mux as well. Refer <a href="https://build.gluster.org/job/regression-test-burn-in/4050/consoleText" target="_blank">https://build.gluster.org/job/regression-test-burn-in/4050/consoleText</a> . There's an email in gluster-devel and a BZ 1610240 for the same. <br><br>tests/bugs/bug-1368312.t - Seems to be recent failures (<a href="https://build.gluster.org/job/regression-test-with-multiplex/815/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/815/console</a>) - seems to be a new failure, however seen this for a non-brick-mux case too - <a href="https://build.gluster.org/job/regression-test-burn-in/4039/consoleText" target="_blank">https://build.gluster.org/job/regression-test-burn-in/4039/consoleText</a> . Need some eyes from AFR folks.<br><br>tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick mux, have seen this failing at multiple default regression runs. Refer <a href="https://fstat.gluster.org/failure/392?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all" target="_blank">https://fstat.gluster.org/failure/392?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all</a> . We need help from geo-rep dev to root cause this earlier than later<br><br>tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick mux, have seen this failing at multiple default regression runs. Refer <a href="https://fstat.gluster.org/failure/393?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all" target="_blank">https://fstat.gluster.org/failure/393?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all</a> . We need help from geo-rep dev to root cause this earlier than later<br><br>tests/bugs/glusterd/validating-server-quorum.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/810/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/810/console</a>) - Fails for non-brick-mux cases too, <a href="https://fstat.gluster.org/failure/580?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all" target="_blank">https://fstat.gluster.org/failure/580?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all</a> . Atin has a patch <a href="https://review.gluster.org/20584" target="_blank">https://review.gluster.org/20584</a> which resolves it but patch is failing regression for a different test which is unrelated.<br><br>tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t (Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/809/console" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/809/console</a>) - fails for non brick mux case too - <a href="https://build.gluster.org/job/regression-test-burn-in/4049/consoleText" target="_blank">https://build.gluster.org/job/regression-test-burn-in/4049/consoleText</a> - Need some eyes from AFR folks.<br></div>
</blockquote></div></div>
</blockquote></span></div></div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="m_-8916643001973143584gmail-m_3899595027729854932m_-7168429344193105316gmail_signature"><div dir="ltr"><div>Thanks and Regards,<br></div>Kotresh H R<br></div></div>
</div></div></span><span class="m_-8916643001973143584gmail-">
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a></span></blockquote></div></div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="m_-8916643001973143584gmail_signature"><div dir="ltr"><div>Thanks and Regards,<br></div>Kotresh H R<br></div></div>
</div></div>
</blockquote></div></div>