<div dir="ltr"><div dir="ltr"><div dir="ltr"><br><div>More analysis:</div><div><br></div><div>It looks like in the 1st iteration, the testcase is stuck at the test (TEST ! stat $M0/.history/snap6/aaa) from line 385 of uss.t</div><div><br></div><div>Before, it was the last test to be executed from uss.t. So the assumption was that after the completion of that test (i.e. test from line 385), cleanup function was</div><div>either getting blocked or taking more time to do cleanups.</div><div><br></div><div>Hence the patch [1] was submitted to reduce the amount of work done by cleanup function. The patch ensured that, the snapshots, volume etc created in the test</div><div>are deleted before cleanup function is executed. But even with that, we observed uss.t to fail sometimes (mainly with brick-mux regressions). To get more infomration</div><div>regarding the failure, another patch [2] was sent. From that patch some more information is received.</div><div><br></div><div>1) Everytime uss.t times out, the script (uss.t) is stuck in executing the particular test from line 385 (TEST ! stat $M0/.history/snap6/aaa)</div><div> </div><div> - This test's purpose is to ensure that, looking for a file that does not exist in a snapshot should fail.</div><div><br></div><div>2) Adding TRACE logs via [2] indicates that:</div><div> - the stat request sent by the test reaches snapshot daemon and later the gfapi client instance that the snapshot daemon spawns</div><div> to communicate with the snapshot volume.</div><div> - The stat request is served by the md-cache xlator in the gfapi client instance (and hence successful).</div><div> </div><div> "[2019-05-16 18:31:18.607521]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 392 ! stat /mnt/glusterfs/0/.history/snap6/aaa ++++++++++</div><div> [2019-05-16 18:31:18.617104] T [MSGID: 0] [syncop.c:2424:syncop_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from gfapi to meta-autoload </div><div> [2019-05-16 18:31:18.617119] T [MSGID: 0] [defaults.c:2841:default_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from meta-autoload to 0e69605de2974f1b887deee5b3f63b52 </div><div> [2019-05-16 18:31:18.617130] T [MSGID: 0] [io-stats.c:2709:io_stats_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52 to 0e69605de2974f1b887deee5b3f63b52-io-threads </div><div> [2019-05-16 18:31:18.617142] D [MSGID: 0] [io-threads.c:376:iot_schedule] 0-0e69605de2974f1b887deee5b3f63b52-io-threads: STAT scheduled as fast priority fop </div><div> [2019-05-16 18:31:18.617162] T [MSGID: 0] [defaults.c:2068:default_stat_resume] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52-io-threads to 0e69605de2974f1b887deee5b3f63b52-md-cache </div><div> [2019-05-16 18:31:18.617176] T [MSGID: 0] [md-cache.c:1359:mdc_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-md-cache returned 0 =========> SUCCESSFUL HERE</div><div> [2019-05-16 18:31:18.617186] T [MSGID: 0] [defaults.c:1406:default_stat_cbk] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-io-threads returned 0 </div><div> [2019-05-16 18:31:18.617195] T [MSGID: 0] [io-stats.c:2059:io_stats_stat_cbk] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52 returned 0 </div><div> "</div><div> - The stat response does not reach snapshot daemon. So snapshot daemon is not able to send any response back to the gluster client which initiated this stat request. This leads to client waiting for a response resulting in timeout as per </div><div> the regression test infra (which sets 200 seconds timeout)</div><div><br></div><div> Suspects:</div><div>==========</div><div><br></div><div>* First of all the the stat request from the line 385 (TEST ! stat $M0/.history/snap6/aaa) should not be successful. Because, the test deletes the snapshot "snap6", removes the file "aaa" from the mount point,</div><div> again takes the snapshot "snap6" and performs the stat operatoin on the deleted file "aaa". So the stat should fail.</div><div><br></div><div>* The patch [2] has been sent to collect more information about the failure (with more logs added to snapview-server and also log level being changed to TRACE in the .t file)</div><div> </div><div>[1] <a href="https://review.gluster.org/#/c/glusterfs/+/22649/" target="_blank">https://review.gluster.org/#/c/glusterfs/+/22649/</a><br></div><div>[2] <a href="https://review.gluster.org/#/c/glusterfs/+/22728/">https://review.gluster.org/#/c/glusterfs/+/22728/</a><br></div><div><br></div><div>Regards,</div><div>Raghavendra</div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 1, 2019 at 11:11 AM Sanju Rakonde <<a href="mailto:srakonde@redhat.com" target="_blank">srakonde@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thank you Raghavendra.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath <<a href="mailto:rabhat@redhat.com" target="_blank">rabhat@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div>To make things relatively easy for the cleanup () function in the test framework, I think it would be better to ensure that uss.t itself deletes snapshots and the volume once the tests are done. Patch [1] has been submitted for review.</div><div><br></div><div>[1] <a href="https://review.gluster.org/#/c/glusterfs/+/22649/" target="_blank">https://review.gluster.org/#/c/glusterfs/+/22649/</a></div><div><br></div><div>Regards,</div><div>Raghavendra</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath <<a href="mailto:rabhat@redhat.com" target="_blank">rabhat@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><div>The failure looks similar to the issue I had mentioned in [1]</div><div><br></div><div>In short for some reason the cleanup (the cleanup function that we call in our .t files) seems to be taking more time and also not cleaning up properly. This leads to problems for the 2nd iteration (where basic things</div><div>such as volume creation or volume start itself fails due to ENODATA or ENOENT errors).</div><div><br></div><div>The 2nd iteration of the uss.t ran had the following errors.</div><div><br></div><div>"[2019-04-29 09:08:15.275773]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable false ++++++++++</div><div>[2019-04-29 09:08:15.390550] : volume set patchy nfs.disable false : SUCCESS</div><div>[2019-04-29 09:08:15.404624]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 42 gluster --mode=script --wignore volume start patchy ++++++++++</div><div>[2019-04-29 09:08:15.468780] : volume start patchy : FAILED : Failed to get extended attribute trusted.glusterfs.volume-id for brick dir /d/backends/3/patchy_snap_mnt. Reason : No data available</div><div>"</div><div><br></div><div>These are the initial steps to create and start volume. Why trusted.glusterfs.volume-id extended attribute is absent is not sure. The analysis in [1] had errors of ENOENT (i.e. export directory itself was absent).</div><div>I suspect this to be because of some issue with the cleanup mechanism at the end of the tests.</div><div><br></div><div>[1] <a href="https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html" target="_blank">https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html</a></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 30, 2019 at 8:37 AM Sanju Rakonde <<a href="mailto:srakonde@redhat.com" target="_blank">srakonde@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hi Raghavendra,<div><br></div><div><span style="color:rgb(51,51,51);font-size:14px;white-space:pre-wrap">./tests/basic/uss.t </span>is timing out in release-6 branch consistently. One such instance is <a href="https://review.gluster.org/#/c/glusterfs/+/22641/" target="_blank">https://review.gluster.org/#/c/glusterfs/+/22641/</a>. Can you please look into this?<br><div><div><br></div>-- <br><div dir="ltr" class="gmail-m_-6400777435087477134gmail-m_-4561704249965806194gmail-m_5558056834117261789gmail-m_-6124888913761697358gmail-m_2413148880114628941gmail-m_7922542964927774167gmail_signature"><div dir="ltr"><div>Thanks,<br></div>Sanju<br></div></div></div></div></div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail-m_-6400777435087477134gmail-m_-4561704249965806194gmail-m_5558056834117261789gmail_signature"><div dir="ltr"><div>Thanks,<br></div>Sanju<br></div></div>
</blockquote></div>