[Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t

Raghavendra Gowdappa rgowdapp at redhat.com
Wed Oct 19 04:20:44 UTC 2016



----- Original Message -----
> From: "Vijay Bellur" <vbellur at redhat.com>
> To: "Atin Mukherjee" <amukherj at redhat.com>
> Cc: "Oleksandr Natalenko" <oleksandr at natalenko.name>, "Nithya Balachandran" <nbalacha at redhat.com>, "Raghavendra
> Gowdappa" <rgowdapp at redhat.com>, "Shyam Ranganathan" <srangana at redhat.com>, "Gluster Devel"
> <gluster-devel at gluster.org>
> Sent: Tuesday, October 18, 2016 11:07:39 PM
> Subject: Re: [Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t
> 
> On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj at redhat.com> wrote:
> > Final reminder before I take out the test case from the test file.
> >
> >
> > On Thursday 13 October 2016, Atin Mukherjee <amukherj at redhat.com> wrote:
> >>
> >>
> >>
> >> On Wednesday 12 October 2016, Atin Mukherjee <amukherj at redhat.com> wrote:
> >>>
> >>> So the test fails (intermittently) in check_fs which tries to do a df on
> >>> the mount point for a volume which is carved out of three bricks from 3
> >>> nodes and one node is completely down. A quick look at the mount log
> >>> reveals
> >>> the following:
> >>>
> >>> [2016-10-10 13:58:59.279446]:++++++++++
> >>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> >>> /mnt/glusterfs/0 ++++++++++
> >>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> >>> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> >>> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> >>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies in
> >>> /
> >>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> >>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> >>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> >>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> >>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> >>> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> >>> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> >>> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opendir_resume]
> >>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
> >>> resolution failed
> >>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> >>> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> >>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)
> >>> resolution fail
> >>>
> >>> DHT  team - are these anomalies expected here? I also see opendir and
> >>> statfs failing here too.
> >>
> >>
> >> Any luck with this? I don't see any relevance of having a check_fs test
> >> w.r.t the bug this test case is tagged to. If I don't get to hear on this
> >> in
> >> few days, I'd go ahead and remove this check from the test to avoid the
> >> spurious failure.
> >>
> 
> 
> Looks like dht was not aware of a subvolume being down. We pick up
> first_up_subvolume for winding lookup on the root gfid in dht and in
> this case we have picked up the subvolume referring to the brick which
> was brought down and hence the failure.

Hadn't observed that DHT treats nameless lookups on root different from other paths. Thanks for pointing it out. My initial code reading didn't reveal any reasons as to why lookup is failed in DHT (since the other two subvols were up). Will dig more into it and report back the findings.

> 
> The test has this snippet:
> 
> <snippet>
> # Kill one pseudo-node, make sure the others survive and volume stays up.
> TEST kill_node 3;
> EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
> EXPECT 0 check_fs $M0;
> </snippet>
> 
> Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
> percolate to dht?
> 
> Logs indicate that dht was not aware of the subvolume being down for
> at least 1 second after protocol/client sensed the disconnection.
> 
> [2016-10-10 13:58:58.235700] I [MSGID: 114018]
> [client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
> patchy-client-2. Client process will keep trying to connect to
> glusterd until brick's port is available
> [2016-10-10 13:58:58.245060]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
> online_brick_count ++++++++++
> [2016-10-10 13:58:59.279446]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> /mnt/glusterfs/0 ++++++++++
> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
> in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288927] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.288949] W
> [fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
> (00000000-0000-0000-0000-000000000001) resolution failed
> [2016-10-10 13:58:59.289505] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> 0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
> resolution fail
> 
> Regards,
> Vijay
> 


More information about the Gluster-devel mailing list