[Gluster-devel] NetBSD's read-subvol-entry.t spurious failures explained

Fri Mar 6 11:01:31 UTC 2015

Hi

I tracked down the spurious failures of read-subvol-entry.t on NetBSD.

Here is what should happen: we have a volume with brick0 and brick1.
We disable self-heal, kill brick0, create a file in a directory, 
restart brick0, and we list directory content to check we find the file.

The tested mechanism is that in brick1, trusted.afr.patchy-client-0
accuse brick0 of being outdated, hence AFR should rule out brick0 
for listing directory content, and it should use brick1 which contains
the file we look for.

On NetBSD I can see that AFR never gets trusted.afr.patchy-client-0
and walways things brick0 is fine. AFR randomly picks brick0 or brick1
to list directory content, and when it picks brick0 the test fails.

The reason why trusted.afr.patchy-client-0 is not there is that the
node is cached in kernel FUSE from an earlier lookup. The TTL obtained
at that times tells the kernel this node is still valid, hence the
kernel does not send the new lookup to GlusterFS. Since GlusterFS uses
lookups to referesh client view of xattr, it sticks with older value
where brick0 was not yet oudated, and trusted.afr.patchy-client-0 is 
unset.

Questions:

1) Is NetBSD behavior wrong here? It got a TTL for a node, I understand
it should not send lookups to the filesystem until the TTL is expired.

2) How to fix it? If NetBSD behavior is correct, then I guess the test
only succeeds on Linux by chance and we only need to fix the test.
The change below flush kernel cache before looking for the file:

--- a/tests/basic/afr/read-subvol-entry.t
+++ b/tests/basic/afr/read-subvol-entry.t
@@ -26,6 +26,7 @@ TEST kill_brick $V0 $H0 $B0/brick0
 
 TEST touch $M0/abc/def/ghi
 TEST $CLI volume start $V0 force
+( cd $M0 && umount $M0 ) 
 EXPECT_WITHIN $PROCESS_UP_TIMEOUT "ghi" echo `ls $M0/abc/def/`
 
 #Cleanup



-- 
Emmanuel Dreyfus
manu at netbsd.org