[Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t

Atin Mukherjee amukherj at redhat.com
Wed Oct 19 12:45:35 UTC 2016


On Tue, Oct 18, 2016 at 11:34 PM, Atin Mukherjee <amukherj at redhat.com>
wrote:

> Thanks a lot Vijay for the insights, will test it out and post a patch.


Unfortunately this didn't work. Even replacing EXPECT with EXPECT_WITHIN
fails spuriously.

@Nigel - I'd like to see how often this test fails and based on that take a
call to temporarily remove this check. Could you share the last two weekly
report of the regression failure to help me in figuring it out?


>
> On Tuesday 18 October 2016, Vijay Bellur <vbellur at redhat.com> wrote:
>
>> On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>> > Final reminder before I take out the test case from the test file.
>> >
>> >
>> > On Thursday 13 October 2016, Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>> >>
>> >>
>> >>
>> >> On Wednesday 12 October 2016, Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>> >>>
>> >>> So the test fails (intermittently) in check_fs which tries to do a df
>> on
>> >>> the mount point for a volume which is carved out of three bricks from
>> 3
>> >>> nodes and one node is completely down. A quick look at the mount log
>> reveals
>> >>> the following:
>> >>>
>> >>> [2016-10-10 13:58:59.279446]:++++++++++
>> >>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
>> >>> /mnt/glusterfs/0 ++++++++++
>> >>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
>> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
>>   remote
>> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
>> [Transport
>> >>> endpoint is not connected]
>> >>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
>> >>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found
>> anomalies in /
>> >>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
>> >>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
>> >>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
>> >>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
>> >>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
>> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
>>   remote
>> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
>> [Transport
>> >>> endpoint is not connected]
>> >>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resol
>> ve_gfid_cbk]
>> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
>> resolve
>> >>> (Stale file handle)
>> >>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opend
>> ir_resume]
>> >>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
>> >>> resolution failed
>> >>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resol
>> ve_gfid_cbk]
>> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
>> resolve
>> >>> (Stale file handle)
>> >>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statf
>> s_resume]
>> >>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)
>> >>> resolution fail
>> >>>
>> >>> DHT  team - are these anomalies expected here? I also see opendir and
>> >>> statfs failing here too.
>> >>
>> >>
>> >> Any luck with this? I don't see any relevance of having a check_fs test
>> >> w.r.t the bug this test case is tagged to. If I don't get to hear on
>> this in
>> >> few days, I'd go ahead and remove this check from the test to avoid the
>> >> spurious failure.
>> >>
>>
>>
>> Looks like dht was not aware of a subvolume being down. We pick up
>> first_up_subvolume for winding lookup on the root gfid in dht and in
>> this case we have picked up the subvolume referring to the brick which
>> was brought down and hence the failure.
>>
>> The test has this snippet:
>>
>> <snippet>
>> # Kill one pseudo-node, make sure the others survive and volume stays up.
>> TEST kill_node 3;
>> EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
>> EXPECT 0 check_fs $M0;
>> </snippet>
>>
>> Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
>> percolate to dht?
>>
>> Logs indicate that dht was not aware of the subvolume being down for
>> at least 1 second after protocol/client sensed the disconnection.
>>
>> [2016-10-10 13:58:58.235700] I [MSGID: 114018]
>> [client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
>> patchy-client-2. Client process will keep trying to connect to
>> glusterd until brick's port is available
>> [2016-10-10 13:58:58.245060]:++++++++++
>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
>> online_brick_count ++++++++++
>> [2016-10-10 13:58:59.279446]:++++++++++
>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
>> /mnt/glusterfs/0 ++++++++++
>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
>> remote operation failed. Path: /
>> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
>> connected]
>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
>> in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
>> remote operation failed. Path: /
>> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
>> connected]
>> [2016-10-10 13:58:59.288927] W
>> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
>> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
>> handle)
>> [2016-10-10 13:58:59.288949] W
>> [fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
>> (00000000-0000-0000-0000-000000000001) resolution failed
>> [2016-10-10 13:58:59.289505] W
>> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
>> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
>> handle)
>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
>> resolution fail
>>
>> Regards,
>> Vijay
>>
>
>
> --
> --Atin
>



-- 

~ Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161019/10546bbc/attachment.html>


More information about the Gluster-devel mailing list