<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div>Hi Du/Poornima,<br><br></div>I was analysing bitrot and geo-rep failures and I suspect there is a bug in some perf xlator<br></div>that was one of the cause. I was seeing following behaviour in few runs.<br><br></div>1. Geo-rep synced data to slave. It creats empty file and then rsync syncs data.<br></div>    But test does &quot;stat --format &quot;%F&quot; &lt;file&gt;&quot; to confirm. If it&#39;s empty, it returns<br></div>    &quot;regular empty file&quot; else &quot;regular file&quot;. I believe it did get the &quot;regular empty file&quot;</div><div>    instead of &quot;regular file&quot; until timeout.<br>    <br></div>2. Other behaviour is with bitrot, with brick-mux. If a file is deleted on the back end on one brick<br></div>    and the look up is done. What all performance xlators needs to be disabled to get the lookup/revalidate<br></div>    on the brick where the file was deleted. Earlier, only md-cache was disable and it used to work.<br></div>    No it&#39;s failing intermittently.</div><div><br></div><div>Are there any pending patches around these areas that needs to be merged ?</div><div>If there are, then it could be affecting other tests as well.</div><div><br></div>Thanks,<br></div>Kotresh HR<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 3, 2018 at 3:07 PM, Karthik Subrahmanya <span dir="ltr">&lt;<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><div><div class="h5"><div dir="ltr">On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya &lt;<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya &lt;<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr">On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, &lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I just went through the nightly regression report of brick mux runs and here&#39;s what I can summarize.<br><br>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>===================<br>Fails only with brick-mux<br>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>===================<br>tests/bugs/core/bug-1432542-<wbr>mpx-restart-crash.t - Times out even after 400 secs. Refer <a href="https://fstat.gluster.org/failure/209?state=2&amp;start_date=2018-06-30&amp;end_date=2018-07-31&amp;branch=all" rel="noreferrer" target="_blank">https://fstat.gluster.org/<wbr>failure/209?state=2&amp;start_<wbr>date=2018-06-30&amp;end_date=2018-<wbr>07-31&amp;branch=all</a>, specifically the latest report <a href="https://build.gluster.org/job/regression-test-burn-in/4051/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-burn-in/4051/<wbr>consoleText</a> . Wasn&#39;t timing out as frequently as it was till 12 July. But since 27 July, it has timed out twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770<wbr>b7f12e80d2 has added the delay and now 400 secs isn&#39;t sufficient enough (Mohit?)<br><br>tests/bugs/glusterd/add-brick-<wbr>and-validate-replicated-<wbr>volume-options.t (Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/814/console" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-with-<wbr>multiplex/814/console</a>) -  Test fails only in brick-mux mode, AI on Atin to look at and get back.<br><br>tests/bugs/replicate/bug-<wbr>1433571-undo-pending-only-on-<wbr>up-bricks.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/813/console" rel="noreferrer" target="_blank">https://build.gluster.org/<wbr>job/regression-test-with-<wbr>multiplex/813/console</a>) - Seems like failed just twice in last 30 days as per <a href="https://fstat.gluster.org/failure/251?state=2&amp;start_date=2018-06-30&amp;end_date=2018-07-31&amp;branch=all" rel="noreferrer" target="_blank">https://fstat.gluster.org/<wbr>failure/251?state=2&amp;start_<wbr>date=2018-06-30&amp;end_date=2018-<wbr>07-31&amp;branch=all</a>. Need help from AFR team.<br><br>tests/bugs/quota/bug-1293601.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/812/console" rel="noreferrer" target="_blank">https://build.gluster.org/<wbr>job/regression-test-with-<wbr>multiplex/812/console</a>) - Hasn&#39;t failed after 26 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)<br><br>tests/bitrot/bug-1373520.t - (<a href="https://build.gluster.org/job/regression-test-with-multiplex/811/console" rel="noreferrer" target="_blank">https://build.gluster.org/<wbr>job/regression-test-with-<wbr>multiplex/811/console</a>)  - Hasn&#39;t failed after 27 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)<br><br>tests/bugs/glusterd/remove-<wbr>brick-testcases.t - Failed once with a core, not sure if related to brick mux or not, so not sure if brick mux is culprit here or not. Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/806/console" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-with-<wbr>multiplex/806/console</a> . Seems to be a glustershd crash. Need help from AFR folks.<br><br>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>===================<br>Fails for non-brick mux case too<br>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>==============================<wbr>===================<br>tests/bugs/distribute/bug-<wbr>1122443.t 0 Seems to be failing at my setup very often, with out brick mux as well. Refer <a href="https://build.gluster.org/job/regression-test-burn-in/4050/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-burn-in/4050/<wbr>consoleText</a> . There&#39;s an email in gluster-devel and a BZ 1610240 for the same. <br><br>tests/bugs/bug-1368312.t - Seems to be recent failures (<a href="https://build.gluster.org/job/regression-test-with-multiplex/815/console" rel="noreferrer" target="_blank">https://build.gluster.org/<wbr>job/regression-test-with-<wbr>multiplex/815/console</a>) - seems to be a new failure, however seen this for a non-brick-mux case too - <a href="https://build.gluster.org/job/regression-test-burn-in/4039/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-burn-in/4039/<wbr>consoleText</a> . Need some eyes from AFR folks.<br><br>tests/00-geo-rep/georep-basic-<wbr>dr-tarssh.t - this isn&#39;t specific to brick mux, have seen this failing at multiple default regression runs. Refer <a href="https://fstat.gluster.org/failure/392?state=2&amp;start_date=2018-06-30&amp;end_date=2018-07-31&amp;branch=all" rel="noreferrer" target="_blank">https://fstat.gluster.org/<wbr>failure/392?state=2&amp;start_<wbr>date=2018-06-30&amp;end_date=2018-<wbr>07-31&amp;branch=all</a> . We need help from geo-rep dev to root cause this earlier than later<br><br>tests/00-geo-rep/georep-basic-<wbr>dr-rsync.t - this isn&#39;t specific to brick mux, have seen this failing at multiple default regression runs. Refer <a href="https://fstat.gluster.org/failure/393?state=2&amp;start_date=2018-06-30&amp;end_date=2018-07-31&amp;branch=all" rel="noreferrer" target="_blank">https://fstat.gluster.org/<wbr>failure/393?state=2&amp;start_<wbr>date=2018-06-30&amp;end_date=2018-<wbr>07-31&amp;branch=all</a> . We need help from geo-rep dev to root cause this earlier than later<br><br>tests/bugs/glusterd/<wbr>validating-server-quorum.t (<a href="https://build.gluster.org/job/regression-test-with-multiplex/810/console" rel="noreferrer" target="_blank">https://build.gluster.org/<wbr>job/regression-test-with-<wbr>multiplex/810/console</a>) - Fails for non-brick-mux cases too, <a href="https://fstat.gluster.org/failure/580?state=2&amp;start_date=2018-06-30&amp;end_date=2018-07-31&amp;branch=all" rel="noreferrer" target="_blank">https://fstat.gluster.org/<wbr>failure/580?state=2&amp;start_<wbr>date=2018-06-30&amp;end_date=2018-<wbr>07-31&amp;branch=all</a> .  Atin has a patch <a href="https://review.gluster.org/20584" rel="noreferrer" target="_blank">https://review.gluster.org/<wbr>20584</a> which resolves it but patch is failing regression for a different test which is unrelated.<br><br>tests/bugs/replicate/bug-<wbr>1586020-mark-dirty-for-entry-<wbr>txn-on-quorum-failure.t (Ref - <a href="https://build.gluster.org/job/regression-test-with-multiplex/809/console" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-with-<wbr>multiplex/809/console</a>) - fails for non brick mux case too - <a href="https://build.gluster.org/job/regression-test-burn-in/4049/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/<wbr>regression-test-burn-in/4049/<wbr>consoleText</a> - Need some eyes from AFR folks.<br></div></blockquote></div></div><div dir="auto">I am looking at this. It is not reproducible locally. Trying to do this on soft serve.</div></div></blockquote><div><br></div><div>In soft serve machine also it is not failing where the regression has failed. But I found some other problem in the script.</div><div>Will fix that and add some extra logs so that it should be easier to debug when it fails next time.</div></div></div></blockquote><div> </div></div></div><div>RCA for <span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">tests/bugs/replicate/bug-</span><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">15860<wbr>20-mark-dirty-for-entry-</span><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">txn-<wbr>on-quorum-failure.t failure</span>:</div><div>This test case basically completely fills 2 out of 3 bricks and provisions one brick with some extra space so that entry creation succeeds only on one brick and fails on other bricks.</div><div>Since 2 of the bricks gets filled, only the entry creation succeeds on those bricks but the creation of gfid hard link inside the &quot;.glusterfs&quot; fails. This is a bug in &quot;posix&quot; code with entry transactions.</div><div>If the gfid link creation fails we are just logging an error message and continuing. Since we depend on that gfid, the entry should be deleted if this fails.</div><div>When the shd tries to heal those files it sees that the gfid link is not present for those files and it fails to heal.</div><div><br></div><div>I will send a fix for this, which deletes the entry if it fails to create the link inside .glusterfs.</div><div><br></div><div>Regards,</div><div>Karthik</div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="auto"><br></div><div dir="auto">Regards,</div><div dir="auto">Karthik</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"></div>
______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" rel="noreferrer" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer noreferrer" target="_blank">https://lists.gluster.org/<wbr>mailman/listinfo/gluster-devel</a></blockquote></div></div></div>
</blockquote></div></div>
</blockquote></span></div></div>
<br>______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Thanks and Regards,<br></div>Kotresh H R<br></div></div>
</div>