[Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t

Vijay Bellur vbellur at redhat.com
Tue Oct 18 17:37:39 UTC 2016


On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukherj at redhat.com> wrote:
> Final reminder before I take out the test case from the test file.
>
>
> On Thursday 13 October 2016, Atin Mukherjee <amukherj at redhat.com> wrote:
>>
>>
>>
>> On Wednesday 12 October 2016, Atin Mukherjee <amukherj at redhat.com> wrote:
>>>
>>> So the test fails (intermittently) in check_fs which tries to do a df on
>>> the mount point for a volume which is carved out of three bricks from 3
>>> nodes and one node is completely down. A quick look at the mount log reveals
>>> the following:
>>>
>>> [2016-10-10 13:58:59.279446]:++++++++++
>>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
>>> /mnt/glusterfs/0 ++++++++++
>>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:      remote
>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
>>> endpoint is not connected]
>>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
>>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies in /
>>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
>>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
>>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
>>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
>>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
>>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:      remote
>>> operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
>>> endpoint is not connected]
>>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
>>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve
>>> (Stale file handle)
>>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opendir_resume]
>>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
>>> resolution failed
>>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
>>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to resolve
>>> (Stale file handle)
>>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
>>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)
>>> resolution fail
>>>
>>> DHT  team - are these anomalies expected here? I also see opendir and
>>> statfs failing here too.
>>
>>
>> Any luck with this? I don't see any relevance of having a check_fs test
>> w.r.t the bug this test case is tagged to. If I don't get to hear on this in
>> few days, I'd go ahead and remove this check from the test to avoid the
>> spurious failure.
>>


Looks like dht was not aware of a subvolume being down. We pick up
first_up_subvolume for winding lookup on the root gfid in dht and in
this case we have picked up the subvolume referring to the brick which
was brought down and hence the failure.

The test has this snippet:

<snippet>
# Kill one pseudo-node, make sure the others survive and volume stays up.
TEST kill_node 3;
EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
EXPECT 0 check_fs $M0;
</snippet>

Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
percolate to dht?

Logs indicate that dht was not aware of the subvolume being down for
at least 1 second after protocol/client sensed the disconnection.

[2016-10-10 13:58:58.235700] I [MSGID: 114018]
[client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
patchy-client-2. Client process will keep trying to connect to
glusterd until brick's port is available
[2016-10-10 13:58:58.245060]:++++++++++
G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
online_brick_count ++++++++++
[2016-10-10 13:58:59.279446]:++++++++++
G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
/mnt/glusterfs/0 ++++++++++
[2016-10-10 13:58:59.287973] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2016-10-10 13:58:59.288326] I [MSGID: 109063]
[dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2016-10-10 13:58:59.288352] W [MSGID: 109005]
[dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
[2016-10-10 13:58:59.288643] W [MSGID: 114031]
[client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2016-10-10 13:58:59.288927] W
[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
handle)
[2016-10-10 13:58:59.288949] W
[fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
(00000000-0000-0000-0000-000000000001) resolution failed
[2016-10-10 13:58:59.289505] W
[fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
handle)
[2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
resolution fail

Regards,
Vijay


More information about the Gluster-devel mailing list