[Gluster-devel] Parallel readdir from NFS clients causes incorrect data

Michael Brown michael at netdirect.ca
Thu Apr 4 16:31:49 UTC 2013


I'm not quite keen on trying HEAD on these servers yet, but I did grab
the source package from
http://repos.fedorapeople.org/repos/kkeithle/glusterfs/epel-6Server/SRPMS/
and apply the patch manually.

Much better! Looks like that did the trick.

M.

On 13-04-03 07:57 PM, Anand Avati wrote:
> Here's a patch on top of today's git HEAD, if you can try
> - http://review.gluster.org/4774/
>
> Thanks!
> Avati
>
> On Wed, Apr 3, 2013 at 4:35 PM, Anand Avati <anand.avati at gmail.com
> <mailto:anand.avati at gmail.com>> wrote:
>
>     Hmm, I was be tempted to suggest that you were bitten by the
>     gluster/ext4 readdir's d_off incompatibility issue (which got
>     recently fixed http://review.gluster.org/4711/). But you say it
>     works fine when you do ls one at a time sequentially.
>
>     I just realized after reading your email that, in glusterfs,
>     because we use the same anonymous fd for multiple
>     client/application's readdir query, we have a race in the posix
>     translator where two threads attempt to push/pull the same backend
>     cursor in a chaotic way resulting in duplicate/lost entries. This
>     might be the issue you are seeing, just guessing.
>
>     Will you be willing to try out a source cod patch on top of the
>     git HEAD to rebuild your glusterfs and verify if it fixes the
>     issue? Will really appreciate it!
>
>     Thanks,
>     Avati
>
>     On Wed, Apr 3, 2013 at 2:37 PM, Michael Brown
>     <michael at netdirect.ca <mailto:michael at netdirect.ca>> wrote:
>
>         I'm seeing a problem on my fairly fresh RHEL gluster install.
>         Smells to me like a parallelism problem on the server.
>
>         If I mount a gluster volume via NFS (using glusterd's internal
>         NFS server, nfs-kernel-server) and read a directory from
>         multiple clients *in parallel*, I get inconsistent results
>         across servers. Some files are missing from the directory
>         listing, some may be present twice!
>
>         Exactly which files (or directories!) are missing/duplicated
>         varies each time. But I can very consistently reproduce the
>         behaviour.
>
>         You can see a screenshot here: http://imgur.com/JU8AFrt
>
>         The replication steps are:
>         * clusterssh to each NFS client
>         * unmount /gv0 (to clear cache)
>         * mount /gv0 [1]
>         * ls -al /gv0/common/apache-jmeter-2.9/bin (which is where I
>         first noticed this)
>
>         Here's the rub: if, instead of doing the 'ls' in parallel, I
>         do it in series, it works just fine (consistent correct
>         results everywhere). But hitting the gluster server from
>         multiple clients *at the same time* causes problems.
>
>         I can still stat() and open() the files missing from the
>         directory listing, they just don't show up in an enumeration.
>
>         Mounting gv0 as a gluster client filesystem works just fine.
>
>         Details of my setup:
>         2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit,
>         glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)
>         4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit,
>         glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for
>         testing)
>         gv0 volume information is below
>         bricks are 400GB SSDs with ext4[2]
>         common network is 10GbE, replication between servers happens
>         over direct 10GbE link.
>
>         I will be testing on xfs/btrfs/zfs eventually, but for now I'm
>         on ext4.
>
>         Also attached is my chatlog from asking about this in #gluster
>
>         [1]: fstab line is: fearless1:/gv0 /gv0 nfs
>         defaults,sync,tcp,wsize=8192,rsize=8192 0 0
>         [2]: yes, I've turned off dir_index to avoid That Bug. I've
>         run the d_off test, results are here: http://pastebin.com/zQt5gZnZ
>
>         ----
>         gluster> volume info gv0
>          
>         Volume Name: gv0
>         Type: Distributed-Replicate
>         Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71
>         Status: Started
>         Number of Bricks: 8 x 2 = 16
>         Transport-type: tcp
>         Bricks:
>         Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
>         Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
>         Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
>         Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
>         Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
>         Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
>         Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
>         Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
>         Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
>         Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
>         Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
>         Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
>         Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
>         Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
>         Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
>         Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
>         Options Reconfigured:
>         diagnostics.count-fop-hits: on
>         diagnostics.latency-measurement: on
>         nfs.disable: off
>         ----
>
>         -- 
>         Michael Brown               | `One of the main causes of the fall of
>         Systems Consultant          | the Roman Empire was that, lacking zero,
>         Net Direct Inc.             | they had no way to indicate successful
>         ☎: +1 519 883 1172 x5106 <tel:%2B1%20519%20883%201172%20x5106>    | termination of their C programs.' - Firth
>
>
>         _______________________________________________
>         Gluster-devel mailing list
>         Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>         https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>


-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130404/5e762abf/attachment-0001.html>


More information about the Gluster-devel mailing list