[Gluster-devel] ./tests/basic/uss.t is timing out in release-6 branch

Wed May 22 14:18:03 UTC 2019

More analysis:

It looks like in the 1st iteration, the testcase is stuck  at the test
(TEST ! stat $M0/.history/snap6/aaa) from line 385 of uss.t

Before, it was the last test to be executed from uss.t. So the assumption
was that after the completion of that test (i.e. test from line 385),
cleanup function was
either getting blocked or taking more time to do cleanups.

Hence the patch [1] was submitted to reduce the amount of work done by
cleanup function. The patch ensured that, the snapshots, volume etc created
in the test
are deleted before cleanup function is executed. But even with that, we
observed uss.t to fail sometimes (mainly with brick-mux regressions). To
get more infomration
regarding the failure, another patch [2] was sent. From that patch some
more information is received.

1) Everytime uss.t times out, the script (uss.t) is stuck in executing the
particular test from line 385 (TEST ! stat $M0/.history/snap6/aaa)

    - This test's purpose is to ensure that, looking for a file that does
not exist in a snapshot should fail.

2) Adding TRACE logs via [2] indicates that:
     -  the stat request sent by the test reaches snapshot daemon and later
the gfapi client instance that the snapshot daemon spawns
        to communicate with the snapshot volume.
     - The stat request is served by the md-cache xlator in the gfapi
client instance (and hence successful).

      "[2019-05-16 18:31:18.607521]:++++++++++ G_LOG:./tests/basic/uss.t:
TEST: 392 ! stat /mnt/glusterfs/0/.history/snap6/aaa ++++++++++
       [2019-05-16 18:31:18.617104] T [MSGID: 0]
[syncop.c:2424:syncop_stat] 0-stack-trace: stack-address: 0x7fc63405dba8,
winding from gfapi to meta-autoload
       [2019-05-16 18:31:18.617119] T [MSGID: 0]
[defaults.c:2841:default_stat] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from meta-autoload to
0e69605de2974f1b887deee5b3f63b52
       [2019-05-16 18:31:18.617130] T [MSGID: 0]
[io-stats.c:2709:io_stats_stat] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52 to
0e69605de2974f1b887deee5b3f63b52-io-threads
       [2019-05-16 18:31:18.617142] D [MSGID: 0]
[io-threads.c:376:iot_schedule]
0-0e69605de2974f1b887deee5b3f63b52-io-threads: STAT scheduled as fast
priority fop
       [2019-05-16 18:31:18.617162] T [MSGID: 0]
[defaults.c:2068:default_stat_resume] 0-stack-trace: stack-address:
0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52-io-threads to
0e69605de2974f1b887deee5b3f63b52-md-cache
       [2019-05-16 18:31:18.617176] T [MSGID: 0] [md-cache.c:1359:mdc_stat]
0-stack-trace: stack-address: 0x7fc63405dba8,
0e69605de2974f1b887deee5b3f63b52-md-cache returned 0  =========> SUCCESSFUL
HERE
       [2019-05-16 18:31:18.617186] T [MSGID: 0]
[defaults.c:1406:default_stat_cbk] 0-stack-trace: stack-address:
0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-io-threads returned 0
       [2019-05-16 18:31:18.617195] T [MSGID: 0]
[io-stats.c:2059:io_stats_stat_cbk] 0-stack-trace: stack-address:
0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52 returned 0
      "
    - The stat response does not reach snapshot daemon. So snapshot daemon
is not able to send any response back to the gluster client which initiated
this stat request. This leads to client waiting for a response resulting in
timeout as per
      the regression test infra (which sets 200 seconds timeout)

    Suspects:
==========

* First of all the the stat request from the line 385 (TEST ! stat
$M0/.history/snap6/aaa) should not be successful. Because, the test deletes
the snapshot "snap6", removes the file "aaa" from the mount point,
  again takes the snapshot "snap6" and performs the stat operatoin on the
deleted file "aaa". So the stat should fail.

* The patch [2] has been sent to collect more information about the failure
(with more logs added to snapview-server and also log level being changed
to TRACE in the .t file)

[1] https://review.gluster.org/#/c/glusterfs/+/22649/
[2] https://review.gluster.org/#/c/glusterfs/+/22728/

Regards,
Raghavendra

On Wed, May 1, 2019 at 11:11 AM Sanju Rakonde <srakonde at redhat.com> wrote:

> Thank you Raghavendra.
>
> On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath <
> rabhat at redhat.com> wrote:
>
>>
>> To make things relatively easy for the cleanup () function in the test
>> framework, I think it would be better to ensure that uss.t itself deletes
>> snapshots and the volume once the tests are done. Patch [1] has been
>> submitted for review.
>>
>> [1] https://review.gluster.org/#/c/glusterfs/+/22649/
>>
>> Regards,
>> Raghavendra
>>
>> On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath <
>> rabhat at redhat.com> wrote:
>>
>>>
>>> The failure looks similar to the issue I had mentioned in [1]
>>>
>>> In short for some reason the cleanup (the cleanup function that we call
>>> in our .t files) seems to be taking more time and also not cleaning up
>>> properly. This leads to problems for the 2nd iteration (where basic things
>>> such as volume creation or volume start itself fails due to ENODATA or
>>> ENOENT errors).
>>>
>>> The 2nd iteration of the uss.t ran had the following errors.
>>>
>>> "[2019-04-29 09:08:15.275773]:++++++++++ G_LOG:./tests/basic/uss.t:
>>> TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable
>>> false ++++++++++
>>> [2019-04-29 09:08:15.390550]  : volume set patchy nfs.disable false :
>>> SUCCESS
>>> [2019-04-29 09:08:15.404624]:++++++++++ G_LOG:./tests/basic/uss.t: TEST:
>>> 42 gluster --mode=script --wignore volume start patchy ++++++++++
>>> [2019-04-29 09:08:15.468780]  : volume start patchy : FAILED : Failed to
>>> get extended attribute trusted.glusterfs.volume-id for brick dir
>>> /d/backends/3/patchy_snap_mnt. Reason : No data available
>>> "
>>>
>>> These are the initial steps to create and start volume. Why
>>> trusted.glusterfs.volume-id extended attribute is absent is not sure. The
>>> analysis in [1] had errors of ENOENT (i.e. export directory itself was
>>> absent).
>>> I suspect this to be because of some issue with the cleanup mechanism at
>>> the end of the tests.
>>>
>>> [1]
>>> https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html
>>>
>>> On Tue, Apr 30, 2019 at 8:37 AM Sanju Rakonde <srakonde at redhat.com>
>>> wrote:
>>>
>>>> Hi Raghavendra,
>>>>
>>>> ./tests/basic/uss.t is timing out in release-6 branch consistently.
>>>> One such instance is https://review.gluster.org/#/c/glusterfs/+/22641/.
>>>> Can you please look into this?
>>>>
>>>> --
>>>> Thanks,
>>>> Sanju
>>>>
>>>
>
> --
> Thanks,
> Sanju
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190522/a61a1ed9/attachment-0001.html>