[Gluster-devel] Eager-lock and nfs graph generation

Pranith Kumar K pkarampu at redhat.com
Tue Feb 26 06:50:27 UTC 2013

On 02/20/2013 11:53 AM, Anand Avati wrote:
> Please check http://review.gluster.org/4551. This should fix all the 
> known write-behind/eager-lock interaction gaps. On top of this patch, 
> you can now set a bit in the 'flags' of writev fop coming out of 
> write-behind, and look for it in AFR to be sure that you have the 
> 'protection layer'  of write-behind offering coverage against 
> concurrent writes. With this you can actually eliminate all the 
> glusterd/volgen crud of implementing dependencies between the two 
> options.
> Avati
Flags parameter in writev is coming from fuse/nfs xlators. Is it ok if 
we use xdata instead of flags to convey that write-behind took care of 

> On Tue, Feb 19, 2013 at 7:20 PM, Anand Avati <anand.avati at gmail.com 
> <mailto:anand.avati at gmail.com>> wrote:
>     On Tue, Feb 19, 2013 at 6:11 PM, Pranith Kumar K
>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>         On 02/20/2013 07:03 AM, Anand Avati wrote:
>>         On Tue, Feb 19, 2013 at 5:12 PM, Anand Avati
>>         <anand.avati at gmail.com <mailto:anand.avati at gmail.com>> wrote:
>>             On Tue, Feb 19, 2013 at 3:59 AM, Pranith Kumar K
>>             <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>                 On 02/19/2013 11:26 AM, Anand Avati wrote:
>>>                 Thinking over this, looks like there is a problem!
>>>                 Write-behind guarantees: That a second write request
>>>                 arriving after the acknowledgement of a first
>>>                 overlapping request (whether written-behind or
>>>                 otherwise) will be guaranteed to be fulfilled in the
>>>                 backend in the same order (i.e, the second
>>>                 overlapping request will be "serialized" behind the
>>>                 first one in the fulfillment process)
>>>                 Eager-lock requirement: That write-behind will send
>>>                 no two write requests on an overlapping region at
>>>                 the same time.
>>>                 The requirement-set and guarantee-set have a big
>>>                 overlap, but the requirement-set is not a subset.
>>>                 This is because of O_SYNC writes. write-behind
>>>                 performs write-serialization at fulfillment only for
>>>                 written behind requests (which get covered under the
>>>                 conflict detection code during liability
>>>                 fulfillment). However, if two threads (or apps)
>>>                 issue overlapping O_SYNC writes to the same region
>>>                 at approx same time, then write-behind will let both
>>>                 of them go by without any kind of serialization,
>>>                 into eager lock, violating the assumptions!
>>>                 I'm wondering if it is a safer idea to implement
>>>                 overlap checks within eager-lock code itself rather
>>>                 than depend on write-behind :|
>>>                 Avati
>>>                 On Mon, Feb 11, 2013 at 10:07 PM, Anand Avati
>>>                 <anand.avati at gmail.com
>>>                 <mailto:anand.avati at gmail.com>> wrote:
>>>                     On Mon, Feb 11, 2013 at 9:32 PM, Pranith Kumar K
>>>                     <pkarampu at redhat.com
>>>                     <mailto:pkarampu at redhat.com>> wrote:
>>>                         hi,
>>>                         Please note that this is a case in theory
>>>                         and I did not run into such situation, but I
>>>                         feel it is important to address this.
>>>                         Configuration with 'Eager-lock on" and
>>>                         "write-behind off" should not be allowed as
>>>                         it leads to lock synchronization problems
>>>                         which lead to data in-consistency among
>>>                         replicas in nfs.
>>>                         lets say bricks b1, b2 are in replication.
>>>                         Gluster Nfs server uses 1 anonymous fd to
>>>                         perform all write-fops. If eager-lock is
>>>                         enabled in afr, the lock-owner is used as
>>>                         fd's address which will be same for all
>>>                         write-fops, so there will never be any
>>>                         inodelk contention. If write-behind is
>>>                         disabled, there can be writes that overlap.
>>>                         (Does nfs makes sure that the ranges don't
>>>                         overlap?)
>>>                         Now imagine the following scenario:
>>>                         lets say w1, w2 are 2 write fops on same
>>>                         offset and length. w1 with all '0's and w2
>>>                         with all '1's. If these 2 write fops are
>>>                         executed in 2 different threads, the order
>>>                         of arrival of write fops on b1 can be w1, w2
>>>                         where as on b2 it is w2, w1 leading to data
>>>                         inconsistency between the two replicas. The
>>>                         lock contention will not happen as both
>>>                         lk-owner, transport are same for these 2 fops.
>>>                     Write-behind has to functions - a) performing
>>>                     operations in the background and b) serializing
>>>                     overlapping operations.
>>>                     While the problem does exist, the specifics are
>>>                     different from what you describe. since all
>>>                     writes coming in from NFS will always use the
>>>                     same anonymous FD, two near-in-time/overlapping
>>>                     writes will never contend with inodelk() but
>>>                     instead the second write will inherit the lock
>>>                     and changelog from the first. In either case, it
>>>                     is a problem.
>>>                         We can add a check in glusterd for volume
>>>                         set to disallow such configuration, BUT by
>>>                         default write-behind is off in nfs graph and
>>>                         by default eager-lock is on. So we should
>>>                         either turn on write-behind for nfs or turn
>>>                         off eager-lock by default.
>>>                         Could you please suggest how to proceed with
>>>                         this if you agree that I did not miss any
>>>                         important detail that makes this theory invalid.
>>>                     It seems loading write-behind xlator in NFS
>>>                     graph  looks like a simpler solution.
>>>                     eager-locking is crucial for replicated NFS
>>>                     write performance.
>>>                     Avati
>>                 Shall we disable eager-lock for files opened with
>>                 O_SYNC, for now?
>>             Bad news: the problem is slightly worse than just this.
>>             Even with non-O_SYNC writes, there is a possibility in
>>             write-behind where, if a second overlapping write request
>>             comes so close to the first request that, if wb_enqueue()
>>             of the second one happens after wb_enqueue() of the first
>>             write, but before any unwind() after the first
>>             wb_enqueue() (i.e wb_inode->gen is not bumped), then the
>>             two write requests can be wound down together to eager lock.
>>         But this has a simple fix - http://review.gluster.org/4550.
>>         Disabling eager-locking for O_SYNC files is a bad idea. We
>>         absolutely want eager-locking for O_SYNC files. Thinking more..
>>         Avati
>         Why is disabling eager-lock for O_SYNC files a bad idea? It is
>         acceptable to sacrifice a bit of performance for O_SYNC isn't it?
>      s/bit/quite a bit/. For O_SYNC writes, eager locking is the only
>     saving grace in performance as write-behind stays out of the way
>     completely. We would need overlap checks either in AFR or
>     write-behind for O_SYNC writes.
>     Avati

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130226/ec7b7732/attachment-0001.html>

More information about the Gluster-devel mailing list