[Gluster-devel] Eager-lock and nfs graph generation

Anand Avati anand.avati at gmail.com
Wed Feb 20 06:23:14 UTC 2013


Please check http://review.gluster.org/4551. This should fix all the known
write-behind/eager-lock interaction gaps. On top of this patch, you can now
set a bit in the 'flags' of writev fop coming out of write-behind, and look
for it in AFR to be sure that you have the 'protection layer'  of
write-behind offering coverage against concurrent writes. With this you can
actually eliminate all the glusterd/volgen crud of implementing
dependencies between the two options.

Avati

On Tue, Feb 19, 2013 at 7:20 PM, Anand Avati <anand.avati at gmail.com> wrote:

>
>
> On Tue, Feb 19, 2013 at 6:11 PM, Pranith Kumar K <pkarampu at redhat.com>wrote:
>
>>  On 02/20/2013 07:03 AM, Anand Avati wrote:
>>
>>
>>
>> On Tue, Feb 19, 2013 at 5:12 PM, Anand Avati <anand.avati at gmail.com>wrote:
>>
>>>
>>>
>>>  On Tue, Feb 19, 2013 at 3:59 AM, Pranith Kumar K <pkarampu at redhat.com>wrote:
>>>
>>>>   On 02/19/2013 11:26 AM, Anand Avati wrote:
>>>>
>>>> Thinking over this, looks like there is a problem!
>>>>
>>>> Write-behind guarantees: That a second write request arriving after the
>>>> acknowledgement of a first overlapping request (whether written-behind or
>>>> otherwise) will be guaranteed to be fulfilled in the backend in the same
>>>> order (i.e, the second overlapping request will be "serialized" behind the
>>>> first one in the fulfillment process)
>>>>
>>>> Eager-lock requirement: That write-behind will send no two write
>>>> requests on an overlapping region at the same time.
>>>>
>>>> The requirement-set and guarantee-set have a big overlap, but the
>>>> requirement-set is not a subset.
>>>>
>>>> This is because of O_SYNC writes. write-behind performs
>>>> write-serialization at fulfillment only for written behind requests (which
>>>> get covered under the conflict detection code during liability
>>>> fulfillment). However, if two threads (or apps) issue overlapping O_SYNC
>>>> writes to the same region at approx same time, then write-behind will let
>>>> both of them go by without any kind of serialization, into eager lock,
>>>> violating the assumptions!
>>>>
>>>> I'm wondering if it is a safer idea to implement overlap checks within
>>>> eager-lock code itself rather than depend on write-behind :|
>>>>
>>>> Avati
>>>>
>>>> On Mon, Feb 11, 2013 at 10:07 PM, Anand Avati <anand.avati at gmail.com>wrote:
>>>>
>>>>>
>>>>>
>>>>>  On Mon, Feb 11, 2013 at 9:32 PM, Pranith Kumar K <pkarampu at redhat.com
>>>>> > wrote:
>>>>>
>>>>>>  hi,
>>>>>> Please note that this is a case in theory and I did not run into such
>>>>>> situation, but I feel it is important to address this.
>>>>>> Configuration with 'Eager-lock on" and "write-behind off" should not
>>>>>> be allowed as it leads to lock synchronization problems which lead to data
>>>>>> in-consistency among replicas in nfs.
>>>>>> lets say bricks b1, b2 are in replication.
>>>>>> Gluster Nfs server uses 1 anonymous fd to perform all write-fops. If
>>>>>> eager-lock is enabled in afr, the lock-owner is used as fd's address which
>>>>>> will be same for all write-fops, so there will never be any inodelk
>>>>>> contention. If write-behind is disabled, there can be writes that overlap.
>>>>>> (Does nfs makes sure that the ranges don't overlap?)
>>>>>>
>>>>>> Now imagine the following scenario:
>>>>>> lets say w1, w2 are 2 write fops on same offset and length. w1 with
>>>>>> all '0's and w2 with all '1's. If these 2 write fops are executed in 2
>>>>>> different threads, the order of arrival of write fops on b1 can be w1, w2
>>>>>> where as on b2 it is w2, w1 leading to data inconsistency between the two
>>>>>> replicas. The lock contention will not happen as both lk-owner, transport
>>>>>> are same for these 2 fops.
>>>>>>
>>>>>
>>>>>  Write-behind has to functions - a) performing operations in the
>>>>> background and b) serializing overlapping operations.
>>>>>
>>>>>  While the problem does exist, the specifics are different from what
>>>>> you describe. since all writes coming in from NFS will always use the same
>>>>> anonymous FD, two near-in-time/overlapping writes will never contend with
>>>>> inodelk() but instead the second write will inherit the lock and changelog
>>>>> from the first. In either case, it is a problem.
>>>>>
>>>>>
>>>>>>  We can add a check in glusterd for volume set to disallow such
>>>>>> configuration, BUT by default write-behind is off in nfs graph and by
>>>>>> default eager-lock is on. So we should either turn on write-behind for nfs
>>>>>> or turn off eager-lock by default.
>>>>>>
>>>>>> Could you please suggest how to proceed with this if you agree that I
>>>>>> did not miss any important detail that makes this theory invalid.
>>>>>>
>>>>>
>>>>>  It seems loading write-behind xlator in NFS graph  looks like a
>>>>> simpler solution. eager-locking is crucial for replicated NFS write
>>>>> performance.
>>>>>
>>>>>  Avati
>>>>>
>>>>
>>>>   Shall we disable eager-lock for files opened with O_SYNC, for now?
>>>>
>>>
>>>   Bad news: the problem is slightly worse than just this. Even with
>>> non-O_SYNC writes, there is a possibility in write-behind where, if a
>>> second overlapping write request comes so close to the first request that,
>>> if wb_enqueue() of the second one happens after wb_enqueue() of the first
>>> write, but before any unwind() after the first wb_enqueue() (i.e
>>> wb_inode->gen is not bumped), then the two write requests can be wound down
>>> together to eager lock.
>>>
>>>
>>  But this has a simple fix - http://review.gluster.org/4550. Disabling
>> eager-locking for O_SYNC files is a bad idea. We absolutely want
>> eager-locking for O_SYNC files. Thinking more..
>>
>>  Avati
>>
>> Why is disabling eager-lock for O_SYNC files a bad idea? It is acceptable
>> to sacrifice a bit of performance for O_SYNC isn't it?
>>
>
>  s/bit/quite a bit/. For O_SYNC writes, eager locking is the only saving
> grace in performance as write-behind stays out of the way completely. We
> would need overlap checks either in AFR or write-behind for O_SYNC writes.
>
> Avati
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130219/d0d7faab/attachment-0001.html>


More information about the Gluster-devel mailing list