[Gluster-devel] Issue with posix locks
Xavi Hernandez
xhernandez at redhat.com
Mon Apr 1 07:54:22 UTC 2019
On Sun, Mar 31, 2019 at 7:59 PM Soumya Koduri <skoduri at redhat.com> wrote:
>
>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > Hi all,
> >
> > there is one potential problem with posix locks when used in a
> > replicated or dispersed volume.
> >
> > Some background:
> >
> > Posix locks allow any process to lock a region of a file multiple times,
> > but a single unlock on a given region will release all previous locks.
> > Locked regions can be different for each lock request and they can
> > overlap. The resulting lock will cover the union of all locked regions.
> > A single unlock (the region doesn't necessarily need to match any of the
> > ranges used for locking) will create a "hole" in the currently locked
> > region, independently of how many times a lock request covered that
> region.
> >
> > For this reason, the locks xlator simply combines the locked regions
> > that are requested, but it doesn't track each individual lock range.
> >
> > Under normal circumstances this works fine. But there are some cases
> > where this behavior is not sufficient. For example, suppose we have a
> > replica 3 volume with quorum = 2. Given the special nature of posix
> > locks, AFR sends the lock request sequentially to each one of the
> > bricks, to avoid that conflicting lock requests from other clients could
> > require to unlock an already locked region on the client that has not
> > got enough successful locks (i.e. quorum). An unlock here not only would
> > cancel the current lock request. It would also cancel any previously
> > acquired lock.
> >
>
> I may not have fully understood, please correct me. AFAIU, lk xlator
> merges locks only if both the lk-owner and the client opaque matches.
>
> In the case which you have mentioned above, considering clientA acquired
> locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
> alone- clientB now has to unlock/cancel the lock it acquired on nodeC.
>
> You are saying the it could pose a problem if there were already
> successful locks taken by clientB for the same region which would get
> unlocked by this particular unlock request..right?
>
Yes
>
> Assuming the previous locks acquired by clientB are shared (otherwise
> clientA wouldn't have got granted lock for the same region on nodeA &
> nodeB), they would still hold true on nodeA & nodeB as the unlock
> request was sent to only nodeC. Since the majority of quorum nodes still
> hold the locks by clientB, this isn't serious issue IMO.
>
Partially true. But if one of nodeA or nodeB dies or gets disconnected,
there won't be any majority of bricks with correct locks, even though there
are still 2 alive bricks. At this point, another client could successfully
acquire a lock that, in theory, is already acquired by another client.
> I haven't looked into heal part but would like to understand if this is
> really an issue in normal scenarios as well.
>
If we consider that a brick disconnection is a normal scenario (which I
think it should be on a large scale distributed file system), then this
issue exists. But even without brick disconnections we can get incorrect
results, as Pranith has just explained.
Xavi
>
> Thanks,
> Soumya
>
> > However, when something goes wrong (a brick dies during a lock request,
> > or there's a network partition or some other weird situation), it could
> > happen that even using sequential locking, only one brick succeeds the
> > lock request. In this case, AFR should cancel the previous lock (and it
> > does), but this also cancels any previously acquired lock on that
> > region, which is not good.
> >
> > A similar thing can happen if we try to recover (heal) posix locks that
> > were active after a brick has been disconnected (for any reason) and
> > then reconnected.
> >
> > To fix all these situations we need to change the way posix locks are
> > managed by locks xlator. One possibility would be to embed the lock
> > request inside an inode transaction using inodelk. Since inodelks do not
> > suffer this problem, the follwing posix lock could be sent safely.
> > However this implies an additional network request, which could cause
> > some performance impact. Eager-locking could minimize the impact in some
> > cases. However this approach won't work for lock recovery after a
> > disconnect.
> >
> > Another possibility is to send a special partial posix lock request
> > which won't be immediately merged with already existing locks once
> > granted. An additional confirmation request of the partial posix lock
> > will be required to fully grant the current lock and merge it with the
> > existing ones. This requires a new network request, which will add
> > latency, and makes everything more complex since there would be more
> > combinations of states in which something could fail.
> >
> > So I think one possible solution would be the following:
> >
> > 1. Keep each posix lock as an independent object in locks xlator. This
> > will make it possible to "invalidate" any already granted lock without
> > affecting already established locks.
> >
> > 2. Additionally, we'll keep a sorted list of non-overlapping segments of
> > locked regions. And we'll count, for each region, how many locks are
> > referencing it. One lock can reference multiple segments, and each
> > segment can be referenced by multiple locks.
> >
> > 3. An additional lock request that overlaps with an existing segment,
> > can cause this segment to be split to satisfy the non-overlapping
> property.
> >
> > 4. When an unlock request is received, all segments intersecting with
> > the region are eliminated (it may require some segment splits on the
> > edges), and the unlocked region is subtracted from each lock associated
> > to the segment. If a lock gets an empty region, it's removed.
> >
> > 5. We'll create a special "remove lock" request that doesn't unlock a
> > region but removes an already granted lock. This will decrease the
> > number of references to each of the segments this lock was covering. If
> > some segment reaches 0, it's removed. Otherwise it remains there. This
> > special request will only be used internally to cancel already acquired
> > locks that cannot be fully granted due to quorum issues or any other
> > problem.
> >
> > In some weird cases, the list of segments can be huge (many locks
> > overlapping only on a single byte, so each segment represents only one
> > byte). We can try to find some smarter structure that minimizes this
> > problem or limit the number of segments (for example returning ENOLCK
> > when there are too many).
> >
> > What do you think ?
> >
> > Xavi
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190401/f73efbfb/attachment.html>
More information about the Gluster-devel
mailing list