[Gluster-devel] Eager-lock and nfs graph generation

Pranith Kumar K pkarampu at redhat.com
Wed Feb 20 02:11:40 UTC 2013

On 02/20/2013 07:03 AM, Anand Avati wrote:
> On Tue, Feb 19, 2013 at 5:12 PM, Anand Avati <anand.avati at gmail.com 
> <mailto:anand.avati at gmail.com>> wrote:
>     On Tue, Feb 19, 2013 at 3:59 AM, Pranith Kumar K
>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>         On 02/19/2013 11:26 AM, Anand Avati wrote:
>>         Thinking over this, looks like there is a problem!
>>         Write-behind guarantees: That a second write request arriving
>>         after the acknowledgement of a first overlapping request
>>         (whether written-behind or otherwise) will be guaranteed to
>>         be fulfilled in the backend in the same order (i.e, the
>>         second overlapping request will be "serialized" behind the
>>         first one in the fulfillment process)
>>         Eager-lock requirement: That write-behind will send no two
>>         write requests on an overlapping region at the same time.
>>         The requirement-set and guarantee-set have a big overlap, but
>>         the requirement-set is not a subset.
>>         This is because of O_SYNC writes. write-behind performs
>>         write-serialization at fulfillment only for written behind
>>         requests (which get covered under the conflict detection code
>>         during liability fulfillment). However, if two threads (or
>>         apps) issue overlapping O_SYNC writes to the same region at
>>         approx same time, then write-behind will let both of them go
>>         by without any kind of serialization, into eager lock,
>>         violating the assumptions!
>>         I'm wondering if it is a safer idea to implement overlap
>>         checks within eager-lock code itself rather than depend on
>>         write-behind :|
>>         Avati
>>         On Mon, Feb 11, 2013 at 10:07 PM, Anand Avati
>>         <anand.avati at gmail.com <mailto:anand.avati at gmail.com>> wrote:
>>             On Mon, Feb 11, 2013 at 9:32 PM, Pranith Kumar K
>>             <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>                 hi,
>>                 Please note that this is a case in theory and I did
>>                 not run into such situation, but I feel it is
>>                 important to address this.
>>                 Configuration with 'Eager-lock on" and "write-behind
>>                 off" should not be allowed as it leads to lock
>>                 synchronization problems which lead to data
>>                 in-consistency among replicas in nfs.
>>                 lets say bricks b1, b2 are in replication.
>>                 Gluster Nfs server uses 1 anonymous fd to perform all
>>                 write-fops. If eager-lock is enabled in afr, the
>>                 lock-owner is used as fd's address which will be same
>>                 for all write-fops, so there will never be any
>>                 inodelk contention. If write-behind is disabled,
>>                 there can be writes that overlap. (Does nfs makes
>>                 sure that the ranges don't overlap?)
>>                 Now imagine the following scenario:
>>                 lets say w1, w2 are 2 write fops on same offset and
>>                 length. w1 with all '0's and w2 with all '1's. If
>>                 these 2 write fops are executed in 2 different
>>                 threads, the order of arrival of write fops on b1 can
>>                 be w1, w2 where as on b2 it is w2, w1 leading to data
>>                 inconsistency between the two replicas. The lock
>>                 contention will not happen as both lk-owner,
>>                 transport are same for these 2 fops.
>>             Write-behind has to functions - a) performing operations
>>             in the background and b) serializing overlapping operations.
>>             While the problem does exist, the specifics are different
>>             from what you describe. since all writes coming in from
>>             NFS will always use the same anonymous FD, two
>>             near-in-time/overlapping writes will never contend with
>>             inodelk() but instead the second write will inherit the
>>             lock and changelog from the first. In either case, it is
>>             a problem.
>>                 We can add a check in glusterd for volume set to
>>                 disallow such configuration, BUT by default
>>                 write-behind is off in nfs graph and by default
>>                 eager-lock is on. So we should either turn on
>>                 write-behind for nfs or turn off eager-lock by default.
>>                 Could you please suggest how to proceed with this if
>>                 you agree that I did not miss any important detail
>>                 that makes this theory invalid.
>>             It seems loading write-behind xlator in NFS graph  looks
>>             like a simpler solution. eager-locking is crucial for
>>             replicated NFS write performance.
>>             Avati
>         Shall we disable eager-lock for files opened with O_SYNC, for now?
>     Bad news: the problem is slightly worse than just this. Even with
>     non-O_SYNC writes, there is a possibility in write-behind where,
>     if a second overlapping write request comes so close to the first
>     request that, if wb_enqueue() of the second one happens after
>     wb_enqueue() of the first write, but before any unwind() after the
>     first wb_enqueue() (i.e wb_inode->gen is not bumped), then the two
>     write requests can be wound down together to eager lock.
> But this has a simple fix - http://review.gluster.org/4550. Disabling 
> eager-locking for O_SYNC files is a bad idea. We absolutely want 
> eager-locking for O_SYNC files. Thinking more..
> Avati
Why is disabling eager-lock for O_SYNC files a bad idea? It is 
acceptable to sacrifice a bit of performance for O_SYNC isn't it?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130220/0087a945/attachment-0001.html>

More information about the Gluster-devel mailing list