[Gluster-devel] Eager-lock and nfs graph generation
Pranith Kumar K
pkarampu at redhat.com
Tue Feb 26 06:50:27 UTC 2013
On 02/20/2013 11:53 AM, Anand Avati wrote:
> Please check http://review.gluster.org/4551. This should fix all the
> known write-behind/eager-lock interaction gaps. On top of this patch,
> you can now set a bit in the 'flags' of writev fop coming out of
> write-behind, and look for it in AFR to be sure that you have the
> 'protection layer' of write-behind offering coverage against
> concurrent writes. With this you can actually eliminate all the
> glusterd/volgen crud of implementing dependencies between the two
> options.
>
> Avati
>
Flags parameter in writev is coming from fuse/nfs xlators. Is it ok if
we use xdata instead of flags to convey that write-behind took care of
overlaps?
Pranith
> On Tue, Feb 19, 2013 at 7:20 PM, Anand Avati <anand.avati at gmail.com
> <mailto:anand.avati at gmail.com>> wrote:
>
>
>
> On Tue, Feb 19, 2013 at 6:11 PM, Pranith Kumar K
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
> On 02/20/2013 07:03 AM, Anand Avati wrote:
>>
>>
>> On Tue, Feb 19, 2013 at 5:12 PM, Anand Avati
>> <anand.avati at gmail.com <mailto:anand.avati at gmail.com>> wrote:
>>
>>
>>
>> On Tue, Feb 19, 2013 at 3:59 AM, Pranith Kumar K
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>
>> On 02/19/2013 11:26 AM, Anand Avati wrote:
>>>
>>> Thinking over this, looks like there is a problem!
>>>
>>> Write-behind guarantees: That a second write request
>>> arriving after the acknowledgement of a first
>>> overlapping request (whether written-behind or
>>> otherwise) will be guaranteed to be fulfilled in the
>>> backend in the same order (i.e, the second
>>> overlapping request will be "serialized" behind the
>>> first one in the fulfillment process)
>>>
>>> Eager-lock requirement: That write-behind will send
>>> no two write requests on an overlapping region at
>>> the same time.
>>>
>>> The requirement-set and guarantee-set have a big
>>> overlap, but the requirement-set is not a subset.
>>>
>>> This is because of O_SYNC writes. write-behind
>>> performs write-serialization at fulfillment only for
>>> written behind requests (which get covered under the
>>> conflict detection code during liability
>>> fulfillment). However, if two threads (or apps)
>>> issue overlapping O_SYNC writes to the same region
>>> at approx same time, then write-behind will let both
>>> of them go by without any kind of serialization,
>>> into eager lock, violating the assumptions!
>>>
>>> I'm wondering if it is a safer idea to implement
>>> overlap checks within eager-lock code itself rather
>>> than depend on write-behind :|
>>>
>>> Avati
>>>
>>>
>>> On Mon, Feb 11, 2013 at 10:07 PM, Anand Avati
>>> <anand.avati at gmail.com
>>> <mailto:anand.avati at gmail.com>> wrote:
>>>
>>>
>>>
>>> On Mon, Feb 11, 2013 at 9:32 PM, Pranith Kumar K
>>> <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>> wrote:
>>>
>>> hi,
>>> Please note that this is a case in theory
>>> and I did not run into such situation, but I
>>> feel it is important to address this.
>>> Configuration with 'Eager-lock on" and
>>> "write-behind off" should not be allowed as
>>> it leads to lock synchronization problems
>>> which lead to data in-consistency among
>>> replicas in nfs.
>>> lets say bricks b1, b2 are in replication.
>>> Gluster Nfs server uses 1 anonymous fd to
>>> perform all write-fops. If eager-lock is
>>> enabled in afr, the lock-owner is used as
>>> fd's address which will be same for all
>>> write-fops, so there will never be any
>>> inodelk contention. If write-behind is
>>> disabled, there can be writes that overlap.
>>> (Does nfs makes sure that the ranges don't
>>> overlap?)
>>>
>>> Now imagine the following scenario:
>>> lets say w1, w2 are 2 write fops on same
>>> offset and length. w1 with all '0's and w2
>>> with all '1's. If these 2 write fops are
>>> executed in 2 different threads, the order
>>> of arrival of write fops on b1 can be w1, w2
>>> where as on b2 it is w2, w1 leading to data
>>> inconsistency between the two replicas. The
>>> lock contention will not happen as both
>>> lk-owner, transport are same for these 2 fops.
>>>
>>>
>>> Write-behind has to functions - a) performing
>>> operations in the background and b) serializing
>>> overlapping operations.
>>>
>>> While the problem does exist, the specifics are
>>> different from what you describe. since all
>>> writes coming in from NFS will always use the
>>> same anonymous FD, two near-in-time/overlapping
>>> writes will never contend with inodelk() but
>>> instead the second write will inherit the lock
>>> and changelog from the first. In either case, it
>>> is a problem.
>>>
>>> We can add a check in glusterd for volume
>>> set to disallow such configuration, BUT by
>>> default write-behind is off in nfs graph and
>>> by default eager-lock is on. So we should
>>> either turn on write-behind for nfs or turn
>>> off eager-lock by default.
>>>
>>> Could you please suggest how to proceed with
>>> this if you agree that I did not miss any
>>> important detail that makes this theory invalid.
>>>
>>>
>>> It seems loading write-behind xlator in NFS
>>> graph looks like a simpler solution.
>>> eager-locking is crucial for replicated NFS
>>> write performance.
>>>
>>> Avati
>>>
>>>
>> Shall we disable eager-lock for files opened with
>> O_SYNC, for now?
>>
>>
>> Bad news: the problem is slightly worse than just this.
>> Even with non-O_SYNC writes, there is a possibility in
>> write-behind where, if a second overlapping write request
>> comes so close to the first request that, if wb_enqueue()
>> of the second one happens after wb_enqueue() of the first
>> write, but before any unwind() after the first
>> wb_enqueue() (i.e wb_inode->gen is not bumped), then the
>> two write requests can be wound down together to eager lock.
>>
>>
>> But this has a simple fix - http://review.gluster.org/4550.
>> Disabling eager-locking for O_SYNC files is a bad idea. We
>> absolutely want eager-locking for O_SYNC files. Thinking more..
>>
>> Avati
> Why is disabling eager-lock for O_SYNC files a bad idea? It is
> acceptable to sacrifice a bit of performance for O_SYNC isn't it?
>
>
> s/bit/quite a bit/. For O_SYNC writes, eager locking is the only
> saving grace in performance as write-behind stays out of the way
> completely. We would need overlap checks either in AFR or
> write-behind for O_SYNC writes.
>
> Avati
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130226/ec7b7732/attachment-0001.html>
More information about the Gluster-devel
mailing list