[Gluster-devel] Brick synchronization on unlink/rmdir/rename

Fri May 18 06:51:35 UTC 2018

Hi all,

tl;dr: to solve a bug [1] I've written a patch [2] that needs some hacks to
prevent deadlocks. I think the approach is quite good and maybe we should
check if the hacks can be eliminated by making use of the new behavior in
DHT and other xlators.

Now a more detailed explanation:

As part of debugging an issue [1] for EC where directories got in a bad
state when there were concurrent renames and mkdirs, I found that the
problem really came from the fact that DHT does additional operations after
creating a directory, and it were these operations what was conflicting
with renames. This is not really a problem in DHT, but it served to detect
the issue.

In fact, any concurrent modification of an entry while it's being removed
can cause this bug. Normally, if you modify a file that is being deleted,
the modification could fail (depending on the state it can fail with
multiple errors, even EIO), but the file will be deleted anyway, so no big
problem here. The problem is more important in the scenario described in
the bug (though unlikely to be a real use case).

The root cause of the problem is that entry modification operations take
locks on the entry itself, while entry creation/deletion take locks on the
parent directory. This allows that both operations arrive at brick level in
parallel, and we can have some bricks executing them in one order and other
bricks executing them in the reverse order.

In the case of EC, having an entry disappearing in the middle of another
operation can be very bad. It can disappear in multiple stages:

1. While acquiring locks. It's not much bad, but can return unwanted errors
(EIO)
2. While requesting size and version information. This is very bad and will
surely lead to EIO errors (instead of ESTALE that would be the most useful
error) or mark healthy bricks as bad.
3. While the main operation is executed. In this case we can even have data
corruption (though the file will be deleted, but it's more important for
directories).
4. While updating backend info. Many combinations are possible here
depending on what failed and on how many bricks.

To prevent this I've taken the following approach [2]:

1. Take into account ESTALE errors when acquiring locks, so that EC is able
to report ESTALE instead of EIO on most cases.
2. Once a lock is acquired, the locks xlator will prevent the entry from
being deleted until locks are released again, giving a lot of stability to
client side.

I've tested this approach and it works pretty well. On heavy contention of
modifications and removals, the client side sees completely stable and
consistent updates, which is very good.

The problem I've seen is that it's quite easy to get into deadlocks with
this approach. The main reason for the deadlocks is how DHT tries to solve
the same or similar problems from DHT level (I've found this document [3]
that explains them).

To prevent deadlocks I've had to completely ignore locks coming from
special PIDs (< 0), but I've also had to ignore locks from domain
"dht.file.migrate". Otherwise I have no way to prevent deadlocks on
concurrent renames to the same destination from multiple clients.

I don't like to use these hacks in the code because they tend to be hard to
maintain and prone to errors. Do you think it makes sense to see if some of
the DHT synchronization methods can be simplified by making use of this
feature ? with some changes, we would probably be able to remove the hacks.

Other ideas on how to implement this without delaying removes (thus
preventing deadlocks) are also welcome.

Xavi

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1578405
[2] https://review.gluster.org/20025
[3] https://review.gluster.org/16876
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180518/bb89029a/attachment.html>