[Gluster-devel] Issue with posix locks

Fri Mar 29 18:25:52 UTC 2019

Hi all,

there is one potential problem with posix locks when used in a replicated
or dispersed volume.

Some background:

Posix locks allow any process to lock a region of a file multiple times,
but a single unlock on a given region will release all previous locks.
Locked regions can be different for each lock request and they can overlap.
The resulting lock will cover the union of all locked regions. A single
unlock (the region doesn't necessarily need to match any of the ranges used
for locking) will create a "hole" in the currently locked region,
independently of how many times a lock request covered that region.

For this reason, the locks xlator simply combines the locked regions that
are requested, but it doesn't track each individual lock range.

Under normal circumstances this works fine. But there are some cases where
this behavior is not sufficient. For example, suppose we have a replica 3
volume with quorum = 2. Given the special nature of posix locks, AFR sends
the lock request sequentially to each one of the bricks, to avoid that
conflicting lock requests from other clients could require to unlock an
already locked region on the client that has not got enough successful
locks (i.e. quorum). An unlock here not only would cancel the current lock
request. It would also cancel any previously acquired lock.

However, when something goes wrong (a brick dies during a lock request, or
there's a network partition or some other weird situation), it could happen
that even using sequential locking, only one brick succeeds the lock
request. In this case, AFR should cancel the previous lock (and it does),
but this also cancels any previously acquired lock on that region, which is
not good.

A similar thing can happen if we try to recover (heal) posix locks that
were active after a brick has been disconnected (for any reason) and then
reconnected.

To fix all these situations we need to change the way posix locks are
managed by locks xlator. One possibility would be to embed the lock request
inside an inode transaction using inodelk. Since inodelks do not suffer
this problem, the follwing posix lock could be sent safely. However this
implies an additional network request, which could cause some performance
impact. Eager-locking could minimize the impact in some cases. However this
approach won't work for lock recovery after a disconnect.

Another possibility is to send a special partial posix lock request which
won't be immediately merged with already existing locks once granted. An
additional confirmation request of the partial posix lock will be required
to fully grant the current lock and merge it with the existing ones. This
requires a new network request, which will add latency, and makes
everything more complex since there would be more combinations of states in
which something could fail.

So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This will
make it possible to "invalidate" any already granted lock without affecting
already established locks.

2. Additionally, we'll keep a sorted list of non-overlapping segments of
locked regions. And we'll count, for each region, how many locks are
referencing it. One lock can reference multiple segments, and each segment
can be referenced by multiple locks.

3. An additional lock request that overlaps with an existing segment, can
cause this segment to be split to satisfy the non-overlapping property.

4. When an unlock request is received, all segments intersecting with the
region are eliminated (it may require some segment splits on the edges),
and the unlocked region is subtracted from each lock associated to the
segment. If a lock gets an empty region, it's removed.

5. We'll create a special "remove lock" request that doesn't unlock a
region but removes an already granted lock. This will decrease the number
of references to each of the segments this lock was covering. If some
segment reaches 0, it's removed. Otherwise it remains there. This special
request will only be used internally to cancel already acquired locks that
cannot be fully granted due to quorum issues or any other problem.

In some weird cases, the list of segments can be huge (many locks
overlapping only on a single byte, so each segment represents only one
byte). We can try to find some smarter structure that minimizes this
problem or limit the number of segments (for example returning ENOLCK when
there are too many).

What do you think ?

Xavi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190329/651e0fd0/attachment.html>