[Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t

Tue Oct 18 18:04:46 UTC 2016

Thanks a lot Vijay for the insights, will test it out and post a patch.

On Tuesday 18 October 2016, Vijay Bellur <vbellur at redhat.com> wrote:

> On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj at redhat.com
> <javascript:;>> wrote:
> > Final reminder before I take out the test case from the test file.
> >
> >
> > On Thursday 13 October 2016, Atin Mukherjee <amukherj at redhat.com
> <javascript:;>> wrote:
> >>
> >>
> >>
> >> On Wednesday 12 October 2016, Atin Mukherjee <amukherj at redhat.com
> <javascript:;>> wrote:
> >>>
> >>> So the test fails (intermittently) in check_fs which tries to do a df
> on
> >>> the mount point for a volume which is carved out of three bricks from 3
> >>> nodes and one node is completely down. A quick look at the mount log
> reveals
> >>> the following:
> >>>
> >>> [2016-10-10 13:58:59.279446]:++++++++++
> >>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> >>> /mnt/glusterfs/0 ++++++++++
> >>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> >>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
> in /
> >>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> >>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> >>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> >>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> >>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_
> resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_
> opendir_resume]
> >>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
> >>> resolution failed
> >>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_
> resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> >>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)
> >>> resolution fail
> >>>
> >>> DHT  team - are these anomalies expected here? I also see opendir and
> >>> statfs failing here too.
> >>
> >>
> >> Any luck with this? I don't see any relevance of having a check_fs test
> >> w.r.t the bug this test case is tagged to. If I don't get to hear on
> this in
> >> few days, I'd go ahead and remove this check from the test to avoid the
> >> spurious failure.
> >>
>
>
> Looks like dht was not aware of a subvolume being down. We pick up
> first_up_subvolume for winding lookup on the root gfid in dht and in
> this case we have picked up the subvolume referring to the brick which
> was brought down and hence the failure.
>
> The test has this snippet:
>
> <snippet>
> # Kill one pseudo-node, make sure the others survive and volume stays up.
> TEST kill_node 3;
> EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
> EXPECT 0 check_fs $M0;
> </snippet>
>
> Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
> percolate to dht?
>
> Logs indicate that dht was not aware of the subvolume being down for
> at least 1 second after protocol/client sensed the disconnection.
>
> [2016-10-10 13:58:58.235700] I [MSGID: 114018]
> [client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
> patchy-client-2. Client process will keep trying to connect to
> glusterd until brick's port is available
> [2016-10-10 13:58:58.245060]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
> online_brick_count ++++++++++
> [2016-10-10 13:58:59.279446]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> /mnt/glusterfs/0 ++++++++++
> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
> in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288927] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.288949] W
> [fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
> (00000000-0000-0000-0000-000000000001) resolution failed
> [2016-10-10 13:58:59.289505] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> 0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
> resolution fail
>
> Regards,
> Vijay
>

-- 
--Atin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161018/0b191bab/attachment-0001.html>