[Gluster-devel] Parallel readdir from NFS clients causes incorrect data

Wed Apr 3 21:37:39 UTC 2013

I'm seeing a problem on my fairly fresh RHEL gluster install. Smells to
me like a parallelism problem on the server.

If I mount a gluster volume via NFS (using glusterd's internal NFS
server, nfs-kernel-server) and read a directory from multiple clients
*in parallel*, I get inconsistent results across servers. Some files are
missing from the directory listing, some may be present twice!

Exactly which files (or directories!) are missing/duplicated varies each
time. But I can very consistently reproduce the behaviour.

You can see a screenshot here: http://imgur.com/JU8AFrt

The replication steps are:
* clusterssh to each NFS client
* unmount /gv0 (to clear cache)
* mount /gv0 [1]
* ls -al /gv0/common/apache-jmeter-2.9/bin (which is where I first
noticed this)

Here's the rub: if, instead of doing the 'ls' in parallel, I do it in
series, it works just fine (consistent correct results everywhere). But
hitting the gluster server from multiple clients *at the same time*
causes problems.

I can still stat() and open() the files missing from the directory
listing, they just don't show up in an enumeration.

Mounting gv0 as a gluster client filesystem works just fine.

Details of my setup:
2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit,
glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)
4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit,
glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for testing)
gv0 volume information is below
bricks are 400GB SSDs with ext4[2]
common network is 10GbE, replication between servers happens over direct
10GbE link.

I will be testing on xfs/btrfs/zfs eventually, but for now I'm on ext4.

Also attached is my chatlog from asking about this in #gluster

[1]: fstab line is: fearless1:/gv0 /gv0 nfs
defaults,sync,tcp,wsize=8192,rsize=8192 0 0
[2]: yes, I've turned off dir_index to avoid That Bug. I've run the
d_off test, results are here: http://pastebin.com/zQt5gZnZ

----
gluster> volume info gv0

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.disable: off
----

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
?: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130403/b1eded17/attachment-0001.html>
-------------- next part --------------
14:56 < Supermathie> I've stumbled across an odd problem with GlusterFS. My setup: 2 x RHEL server with a bunch of SSDs (each is a brick) replicating data between them. Single volume (gv0) exported via gluster's internal NFS.
14:57 < Supermathie> It's mounted on 4 identical servers, into which I regularly clusterssh. If I mount gv0 and do and ls -al inside a directory from all 4 clients at the same time, I get inconsistent results.
14:58 < Supermathie> Some clients see all the files, some are missing a file or two, some have duplicates (!) in the listing.
14:58 < Supermathie> stat() finds the missing files, but ls still doesn't see them.
14:58 < Supermathie> If I remount and list the directory in series instead of in parallel, everything looks good.
15:03 < samppah> Supermathie: what glusterfs version you are using and are you using locking with nfs or do you mount with nolock?
15:06 < Supermathie> glusterfs-3.3.1-1.el6.x86_64 from EPEL, mount options are: rw,sync,tcp,wsize=8192,rsize=8192 (I added sync after I noticed the weird behaviour)
15:08 < samppah> ok.. all nodes are mounting remote directory and not using localhost to mount?
15:08 < samppah> @latest
15:08 <@glusterbot> samppah: The latest version is available at http://goo.gl/zO0Fa . There is a .repo file for yum or see @ppa for ubuntu.
15:08 < Supermathie> lockd is running on clients, I presume gluster has an internal lockd
15:09 < samppah> hmm
15:09 < samppah> @yum repo
15:09 <@glusterbot> samppah: kkeithley's fedorapeople.org yum repository has 32- and 64-bit glusterfs 3.3 packages for RHEL/Fedora/Centos distributions: http://goo.gl/EyoCw
15:09 < Supermathie> samppah: you mean mounting local glusterfs as a localhost nfs client? None of the 4 NFS clients in this setup are participating in the gluster (in the gluster cluster? :) )
15:10 < samppah> Supermathie: yes and ok, that's good :)
15:11 < samppah> i'm not very familiar with gluster nfs solution nor issues it may have
15:11 < samppah> however there are newer packages available at ,,(yum repo)
15:11 <@glusterbot> kkeithley's fedorapeople.org yum repository has 32- and 64-bit glusterfs 3.3 packages for RHEL/Fedora/Centos distributions: http://goo.gl/EyoCw
15:12 < samppah> brb
15:13 < Supermathie> samppah: Can reproduce it at-will: http://imgur.com/JU8AFrt
15:13 <@glusterbot> Title: Odd GlusterFS problem - Imgur (at imgur.com)
15:16 < Chiku|dc> Supermathie, when you do ls -al | md5sum, you do md5sum on the ls text output?
15:16 < Supermathie> Yeah
15:16 < Supermathie> Just gives me a quick way of noting which hosts are different.
15:16 < Supermathie> For instance, saveservice.properties is missing from fleming1
15:16 -!- zykure is now known as zyk|off
15:16 < Supermathie> [michael at fleming1 bin]$ ls -al saveservice.properties
15:16 < Supermathie> -rw-r--r-- 1 michael users 22186 Jan 24 06:21 saveservice.properties
15:16 < Supermathie> [michael at fleming1 bin]$ ls -al | grep saveservice.properties
15:16 < Supermathie> (no result)
15:17 < Supermathie> and mirror-server.sh missing from directory listing on fleming4
15:17 < Supermathie> (and httpclient.parameters)
15:18 -!- zyk|off is now known as zykure
15:20 < Chiku|dc> Supermathie, 4 servers with replica 4 ?
15:20 < Chiku|dc> or replica 2 ?
15:20 < Supermathie> These four servers are NFS clients only. NFS server is two servers with replica 2
15:21 < Chiku|dc> oh ok
15:21 < Supermathie> If I unmount (clear cache) and remount, and ls -al again, I get different results (different files missing on different servers).
15:21 < Supermathie> If I ls -al one at a time on each client, everything's OK.
15:21 < Chiku|dc> what about gluster client ?
15:22 < Chiku|dc> mount glusterfs client
15:27 < Supermathie> Chiku|dc: clients are RHEL5? is glusterfs-client.x86_64 0:2.0.9-2.el5 going to be happy with a 3.3.1 server? probably not? :)
15:27 < JoeJulian> Not quite... :D
15:28 < JoeJulian> Supermathie: Lol!!!!!
15:28 < JoeJulian> Supermathie: no.
15:28 < Supermathie> grabbing the 3.3.1 from kkeithle's repo :)
15:33 < Supermathie> Chiku|dc: mounting as glusterfs client yields consistent correct results
15:34 < JoeJulian> What filesystem are your bricks?
15:34 < Supermathie> ext4
15:35 < JoeJulian> bingo
15:35 < Supermathie> Oh wait, right, *that* problem? ext4 with dir_index turned off
15:35 < JoeJulian> I suspect the same "cookie" problem that's been the focus around the ,,(ext4) problem is what you're seeing with nfs.
15:35 <@glusterbot> Read about the ext4 problem at http://goo.gl/PEBQU
15:36 < JoeJulian> Something about the cookie being inconsistent between calls.
15:36 < stickyboy> I even knew about the ext4 problem, and I was still bit by it deploying GlusterFS last month. :D
15:37 < Supermathie> I encountered the dir_index problem right off the bat (NOTHING worked) and it was fine after turning off dir_index on each brick filesystem
15:37 < JoeJulian> And you thought I was going to just blindly point fingers... Granted, I don't fully understand the problem, but turning off the dir_index was a workaround to prevent the endless loop. I don't /think/ it solves the inconsistent cookie thing.
15:38 < Supermathie> First I heard about an inconsistent cookie? reading 
15:39 < JoeJulian> Check the gluster-devel mailing list. Look for the threads with Bernd and Theodore.
15:40 < jdarcy> ... and me, and Avati, and Zach, and Eric, and ... ;)
15:41 < jdarcy> Long thread.
15:41 < Supermathie> Running the d_off test against the brick dir returns:
15:43 < Supermathie> http://pastebin.com/zQt5gZnZ
15:42 < JoeJulian> jdarcy: Do I have the essence of the problem correct, though? Inconsistent directory listings via nfs mount from an ext4 filesystem?
15:43 < Supermathie> (all 32-bit values)
15:45 < Supermathie> JoeJulian: The odd thing about it is that it *is* consistent? the bug tickle seems to be doing it from 4 clients in parallel. Makes me think something about the request processing at the server is getting confused.
15:46 < JoeJulian> That's why I leaned toward that being the problem.
15:46 < JoeJulian> I could be wrong though.
15:46 < JoeJulian> Try a similar test with xfs and see if it's close.
15:47 < Supermathie> https://bugzilla.redhat.com/show_bug.cgi?id=838784#c14
15:47 <@glusterbot> <http://goo.gl/wBHbB> (at bugzilla.redhat.com)
15:47 <@glusterbot> Bug 838784: high, high, ---, sgowda, POST , DHT: readdirp goes into a infinite loop with ext4
15:48 < Supermathie> I will - I'm in the midst of testing out Oracle doing DNFS to Gluster. Going to be trying out a few different configs and brick filesystems.
15:49 < jdarcy> Inconsistent directory listings seems like might be the hash-collision problem that the ext4 "fix" was trying to address.
15:57 < Supermathie> Whoah? this time, . and .. are missing from one server, and another has them listed twice. And the latter also has 'examples' twice.