[Gluster-devel] Brick synchronization on unlink/rmdir/rename

Fri May 25 09:10:25 UTC 2018

On Fri, May 25, 2018 at 1:58 PM, Amar Tumballi <atumball at redhat.com> wrote:

> Thanks for the detailed email Xavi!! Helps a lot to clarify about the
> intention of the patch.
>
> Raghavendra / Nithya / Susant,
>
> Considering the patch surely needs one of your review to make sure DHT
> locking is not impacted due to the patch, can you review the patch?
>
> As per the locking code itself, Kruthika actually did take a look. Xavi
> and me are waiting on one of your reviews to go forward on this. Others,
> more reviews on the patch is helpful too.
>

Me and Xavi have been interacting on this - mainly for me to understand the
issues he ran into. I'll do a review.

> -Amar
>
> On Fri, May 18, 2018 at 12:21 PM, Xavi Hernandez <xhernandez at redhat.com>
> wrote:
>
>> Hi all,
>>
>> tl;dr: to solve a bug [1] I've written a patch [2] that needs some hacks
>> to prevent deadlocks. I think the approach is quite good and maybe we
>> should check if the hacks can be eliminated by making use of the new
>> behavior in DHT and other xlators.
>>
>> Now a more detailed explanation:
>>
>> As part of debugging an issue [1] for EC where directories got in a bad
>> state when there were concurrent renames and mkdirs, I found that the
>> problem really came from the fact that DHT does additional operations after
>> creating a directory, and it were these operations what was conflicting
>> with renames. This is not really a problem in DHT, but it served to detect
>> the issue.
>>
>> In fact, any concurrent modification of an entry while it's being removed
>> can cause this bug. Normally, if you modify a file that is being deleted,
>> the modification could fail (depending on the state it can fail with
>> multiple errors, even EIO), but the file will be deleted anyway, so no big
>> problem here. The problem is more important in the scenario described in
>> the bug (though unlikely to be a real use case).
>>
>> The root cause of the problem is that entry modification operations take
>> locks on the entry itself, while entry creation/deletion take locks on the
>> parent directory. This allows that both operations arrive at brick level in
>> parallel, and we can have some bricks executing them in one order and other
>> bricks executing them in the reverse order.
>>
>> In the case of EC, having an entry disappearing in the middle of another
>> operation can be very bad. It can disappear in multiple stages:
>>
>> 1. While acquiring locks. It's not much bad, but can return unwanted
>> errors (EIO)
>> 2. While requesting size and version information. This is very bad and
>> will surely lead to EIO errors (instead of ESTALE that would be the most
>> useful error) or mark healthy bricks as bad.
>> 3. While the main operation is executed. In this case we can even have
>> data corruption (though the file will be deleted, but it's more important
>> for directories).
>> 4. While updating backend info. Many combinations are possible here
>> depending on what failed and on how many bricks.
>>
>> To prevent this I've taken the following approach [2]:
>>
>> 1. Take into account ESTALE errors when acquiring locks, so that EC is
>> able to report ESTALE instead of EIO on most cases.
>> 2. Once a lock is acquired, the locks xlator will prevent the entry from
>> being deleted until locks are released again, giving a lot of stability to
>> client side.
>>
>> I've tested this approach and it works pretty well. On heavy contention
>> of modifications and removals, the client side sees completely stable and
>> consistent updates, which is very good.
>>
>> The problem I've seen is that it's quite easy to get into deadlocks with
>> this approach. The main reason for the deadlocks is how DHT tries to solve
>> the same or similar problems from DHT level (I've found this document [3]
>> that explains them).
>>
>> To prevent deadlocks I've had to completely ignore locks coming from
>> special PIDs (< 0), but I've also had to ignore locks from domain
>> "dht.file.migrate". Otherwise I have no way to prevent deadlocks on
>> concurrent renames to the same destination from multiple clients.
>>
>> I don't like to use these hacks in the code because they tend to be hard
>> to maintain and prone to errors. Do you think it makes sense to see if some
>> of the DHT synchronization methods can be simplified by making use of this
>> feature ? with some changes, we would probably be able to remove the hacks.
>>
>> Other ideas on how to implement this without delaying removes (thus
>> preventing deadlocks) are also welcome.
>>
>> Xavi
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1578405
>> [2] https://review.gluster.org/20025
>> [3] https://review.gluster.org/16876
>>
>
>
>
> --
> Amar Tumballi (amarts)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180525/44103267/attachment-0001.html>