[Gluster-devel] Parallel readdir from NFS clients causes incorrect data

Anand Avati anand.avati at gmail.com
Wed Apr 3 23:57:00 UTC 2013


Here's a patch on top of today's git HEAD, if you can try -
http://review.gluster.org/4774/

Thanks!
Avati

On Wed, Apr 3, 2013 at 4:35 PM, Anand Avati <anand.avati at gmail.com> wrote:

> Hmm, I was be tempted to suggest that you were bitten by the gluster/ext4
> readdir's d_off incompatibility issue (which got recently fixed
> http://review.gluster.org/4711/). But you say it works fine when you do
> ls one at a time sequentially.
>
> I just realized after reading your email that, in glusterfs, because we
> use the same anonymous fd for multiple client/application's readdir query,
> we have a race in the posix translator where two threads attempt to
> push/pull the same backend cursor in a chaotic way resulting in
> duplicate/lost entries. This might be the issue you are seeing, just
> guessing.
>
> Will you be willing to try out a source cod patch on top of the git HEAD
> to rebuild your glusterfs and verify if it fixes the issue? Will really
> appreciate it!
>
> Thanks,
> Avati
>
> On Wed, Apr 3, 2013 at 2:37 PM, Michael Brown <michael at netdirect.ca>wrote:
>
>>  I'm seeing a problem on my fairly fresh RHEL gluster install. Smells to
>> me like a parallelism problem on the server.
>>
>> If I mount a gluster volume via NFS (using glusterd's internal NFS
>> server, nfs-kernel-server) and read a directory from multiple clients *in
>> parallel*, I get inconsistent results across servers. Some files are
>> missing from the directory listing, some may be present twice!
>>
>> Exactly which files (or directories!) are missing/duplicated varies each
>> time. But I can very consistently reproduce the behaviour.
>>
>> You can see a screenshot here: http://imgur.com/JU8AFrt
>>
>> The replication steps are:
>> * clusterssh to each NFS client
>> * unmount /gv0 (to clear cache)
>> * mount /gv0 [1]
>> * ls -al /gv0/common/apache-jmeter-2.9/bin (which is where I first
>> noticed this)
>>
>> Here's the rub: if, instead of doing the 'ls' in parallel, I do it in
>> series, it works just fine (consistent correct results everywhere). But
>> hitting the gluster server from multiple clients *at the same time*causes problems.
>>
>> I can still stat() and open() the files missing from the directory
>> listing, they just don't show up in an enumeration.
>>
>> Mounting gv0 as a gluster client filesystem works just fine.
>>
>> Details of my setup:
>> 2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit,
>> glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)
>> 4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit,
>> glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for testing)
>> gv0 volume information is below
>> bricks are 400GB SSDs with ext4[2]
>> common network is 10GbE, replication between servers happens over direct
>> 10GbE link.
>>
>> I will be testing on xfs/btrfs/zfs eventually, but for now I'm on ext4.
>>
>> Also attached is my chatlog from asking about this in #gluster
>>
>> [1]: fstab line is: fearless1:/gv0 /gv0 nfs
>> defaults,sync,tcp,wsize=8192,rsize=8192 0 0
>> [2]: yes, I've turned off dir_index to avoid That Bug. I've run the d_off
>> test, results are here: http://pastebin.com/zQt5gZnZ
>>
>> ----
>> gluster> volume info gv0
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71
>> Status: Started
>> Number of Bricks: 8 x 2 = 16
>> Transport-type: tcp
>> Bricks:
>> Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
>> Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
>> Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
>> Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
>> Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
>> Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
>> Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
>> Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
>> Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
>> Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
>> Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
>> Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
>> Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
>> Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
>> Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
>> Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> nfs.disable: off
>> ----
>>
>> --
>> Michael Brown               | `One of the main causes of the fall of
>> Systems Consultant          | the Roman Empire was that, lacking zero,
>> Net Direct Inc.             | they had no way to indicate successful
>> ☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130403/70c1ad96/attachment-0001.html>


More information about the Gluster-devel mailing list