[Gluster-devel] Brick synchronization on unlink/rmdir/rename

Amar Tumballi atumball at redhat.com
Fri May 25 08:28:02 UTC 2018


Thanks for the detailed email Xavi!! Helps a lot to clarify about the
intention of the patch.

Raghavendra / Nithya / Susant,

Considering the patch surely needs one of your review to make sure DHT
locking is not impacted due to the patch, can you review the patch?

As per the locking code itself, Kruthika actually did take a look. Xavi and
me are waiting on one of your reviews to go forward on this. Others, more
reviews on the patch is helpful too.

-Amar

On Fri, May 18, 2018 at 12:21 PM, Xavi Hernandez <xhernandez at redhat.com>
wrote:

> Hi all,
>
> tl;dr: to solve a bug [1] I've written a patch [2] that needs some hacks
> to prevent deadlocks. I think the approach is quite good and maybe we
> should check if the hacks can be eliminated by making use of the new
> behavior in DHT and other xlators.
>
> Now a more detailed explanation:
>
> As part of debugging an issue [1] for EC where directories got in a bad
> state when there were concurrent renames and mkdirs, I found that the
> problem really came from the fact that DHT does additional operations after
> creating a directory, and it were these operations what was conflicting
> with renames. This is not really a problem in DHT, but it served to detect
> the issue.
>
> In fact, any concurrent modification of an entry while it's being removed
> can cause this bug. Normally, if you modify a file that is being deleted,
> the modification could fail (depending on the state it can fail with
> multiple errors, even EIO), but the file will be deleted anyway, so no big
> problem here. The problem is more important in the scenario described in
> the bug (though unlikely to be a real use case).
>
> The root cause of the problem is that entry modification operations take
> locks on the entry itself, while entry creation/deletion take locks on the
> parent directory. This allows that both operations arrive at brick level in
> parallel, and we can have some bricks executing them in one order and other
> bricks executing them in the reverse order.
>
> In the case of EC, having an entry disappearing in the middle of another
> operation can be very bad. It can disappear in multiple stages:
>
> 1. While acquiring locks. It's not much bad, but can return unwanted
> errors (EIO)
> 2. While requesting size and version information. This is very bad and
> will surely lead to EIO errors (instead of ESTALE that would be the most
> useful error) or mark healthy bricks as bad.
> 3. While the main operation is executed. In this case we can even have
> data corruption (though the file will be deleted, but it's more important
> for directories).
> 4. While updating backend info. Many combinations are possible here
> depending on what failed and on how many bricks.
>
> To prevent this I've taken the following approach [2]:
>
> 1. Take into account ESTALE errors when acquiring locks, so that EC is
> able to report ESTALE instead of EIO on most cases.
> 2. Once a lock is acquired, the locks xlator will prevent the entry from
> being deleted until locks are released again, giving a lot of stability to
> client side.
>
> I've tested this approach and it works pretty well. On heavy contention of
> modifications and removals, the client side sees completely stable and
> consistent updates, which is very good.
>
> The problem I've seen is that it's quite easy to get into deadlocks with
> this approach. The main reason for the deadlocks is how DHT tries to solve
> the same or similar problems from DHT level (I've found this document [3]
> that explains them).
>
> To prevent deadlocks I've had to completely ignore locks coming from
> special PIDs (< 0), but I've also had to ignore locks from domain
> "dht.file.migrate". Otherwise I have no way to prevent deadlocks on
> concurrent renames to the same destination from multiple clients.
>
> I don't like to use these hacks in the code because they tend to be hard
> to maintain and prone to errors. Do you think it makes sense to see if some
> of the DHT synchronization methods can be simplified by making use of this
> feature ? with some changes, we would probably be able to remove the hacks.
>
> Other ideas on how to implement this without delaying removes (thus
> preventing deadlocks) are also welcome.
>
> Xavi
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1578405
> [2] https://review.gluster.org/20025
> [3] https://review.gluster.org/16876
>



-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180525/4f8a6fde/attachment.html>


More information about the Gluster-devel mailing list