[Gluster-devel] Issue with posix locks
Xavi Hernandez
xhernandez at redhat.com
Mon Apr 1 08:53:23 UTC 2019
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri at redhat.com> wrote:
>
>
> On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skoduri at redhat.com
> > <mailto:skoduri at redhat.com>> wrote:
> >
> >
> >
> > On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > > Hi all,
> > >
> > > there is one potential problem with posix locks when used in a
> > > replicated or dispersed volume.
> > >
> > > Some background:
> > >
> > > Posix locks allow any process to lock a region of a file multiple
> > times,
> > > but a single unlock on a given region will release all previous
> > locks.
> > > Locked regions can be different for each lock request and they can
> > > overlap. The resulting lock will cover the union of all locked
> > regions.
> > > A single unlock (the region doesn't necessarily need to match any
> > of the
> > > ranges used for locking) will create a "hole" in the currently
> > locked
> > > region, independently of how many times a lock request covered
> > that region.
> > >
> > > For this reason, the locks xlator simply combines the locked
> regions
> > > that are requested, but it doesn't track each individual lock
> range.
> > >
> > > Under normal circumstances this works fine. But there are some
> cases
> > > where this behavior is not sufficient. For example, suppose we
> > have a
> > > replica 3 volume with quorum = 2. Given the special nature of
> posix
> > > locks, AFR sends the lock request sequentially to each one of the
> > > bricks, to avoid that conflicting lock requests from other
> > clients could
> > > require to unlock an already locked region on the client that has
> > not
> > > got enough successful locks (i.e. quorum). An unlock here not
> > only would
> > > cancel the current lock request. It would also cancel any
> previously
> > > acquired lock.
> > >
> >
> > I may not have fully understood, please correct me. AFAIU, lk xlator
> > merges locks only if both the lk-owner and the client opaque matches.
> >
> > In the case which you have mentioned above, considering clientA
> > acquired
> > locks on majority of quorum (say nodeA and nodeB) and clientB on
> nodeC
> > alone- clientB now has to unlock/cancel the lock it acquired on
> nodeC.
> >
> > You are saying the it could pose a problem if there were already
> > successful locks taken by clientB for the same region which would get
> > unlocked by this particular unlock request..right?
> >
> > Assuming the previous locks acquired by clientB are shared (otherwise
> > clientA wouldn't have got granted lock for the same region on nodeA &
> > nodeB), they would still hold true on nodeA & nodeB as the unlock
> > request was sent to only nodeC. Since the majority of quorum nodes
> > still
> > hold the locks by clientB, this isn't serious issue IMO.
> >
> > I haven't looked into heal part but would like to understand if this
> is
> > really an issue in normal scenarios as well.
> >
> >
> > This is how I understood the code. Consider the following case:
> > Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1
> > and lock-range 2-3 from mount-2.
> > If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel
> > lets say mount-2 issued unlock of 2-3 as well.
> >
> > nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with
> > range 1-7, so the lock is granted and merged to give 1-15
> > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which
> > leads to EAGAIN which will trigger unlocks on granted lock in mount-1
> > which will end up doing unlock of 1-7 on nodeA leading to lock-range
> > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will
> > have range 5-15.
> >
> > Let me know if my understanding is wrong.
>
> Both of us mentioned the same points. So in the example you gave ,
> mount-1 lost its previous lock on nodeA but majority of the quorum
> (nodeB and nodeC) still have the previous lock (range: 5-15) intact. So
> this shouldn't ideally lead to any issues as other conflicting locks are
> blocked or failed by majority of the nodes (provided there are no brick
> dis/re-connects).
>
But brick disconnects will happen (upgrades, disk failures, server
maintenance, ...). Anyway, even without brick disconnects, in the previous
example we have nodeA with range 8-15, and nodes B and C with range 5-15.
If another lock from mount-2 comes for range 5-7, it will succeed on nodeA,
but it will block on nodeB. At this point, mount-1 could attempt a lock on
same range. It will block on nodeA, so we have a deadlock.
In general, having discrepancies between bricks is not good because sooner
or later it will cause some bad inconsistency.
> Wrt to brick disconnects/re-connects, if we can get in general lock
> healing (not getting into implementation details atm) support, that
> should take care of correcting lock range on nodeA as well right?
>
The problem we have seen is that to be able to correctly heal currently
acquired locks on brick reconnect, there are cases where we need to release
a lock that has already been granted (because the current owner doesn't
have enough quorum and a just recovered connection tries to claim/heal it).
In this case we need to deal with locks that have already been merged, but
without interfering with other existing locks that already have quorum.
> That said I am not suggesting that we should stick to existing behavior,
> just trying to get clarification to check if we can avoid any
> overhead/side-effects with maintaining multiple locks.
>
Right now is the only way we have found to provide a correct solution both
for some cases of concurrent lock/unlock requests, and lock healing.
Regards,
Xavi
> Thanks,
> Soumya
>
>
> >
> >
> > Thanks,
> > Soumya
> >
> > > However, when something goes wrong (a brick dies during a lock
> > request,
> > > or there's a network partition or some other weird situation), it
> > could
> > > happen that even using sequential locking, only one brick
> > succeeds the
> > > lock request. In this case, AFR should cancel the previous lock
> > (and it
> > > does), but this also cancels any previously acquired lock on that
> > > region, which is not good.
> > >
> > > A similar thing can happen if we try to recover (heal) posix
> > locks that
> > > were active after a brick has been disconnected (for any reason)
> and
> > > then reconnected.
> > >
> > > To fix all these situations we need to change the way posix locks
> > are
> > > managed by locks xlator. One possibility would be to embed the
> lock
> > > request inside an inode transaction using inodelk. Since inodelks
> > do not
> > > suffer this problem, the follwing posix lock could be sent safely.
> > > However this implies an additional network request, which could
> > cause
> > > some performance impact. Eager-locking could minimize the impact
> > in some
> > > cases. However this approach won't work for lock recovery after a
> > > disconnect.
> > >
> > > Another possibility is to send a special partial posix lock
> request
> > > which won't be immediately merged with already existing locks once
> > > granted. An additional confirmation request of the partial posix
> > lock
> > > will be required to fully grant the current lock and merge it
> > with the
> > > existing ones. This requires a new network request, which will add
> > > latency, and makes everything more complex since there would be
> more
> > > combinations of states in which something could fail.
> > >
> > > So I think one possible solution would be the following:
> > >
> > > 1. Keep each posix lock as an independent object in locks xlator.
> > This
> > > will make it possible to "invalidate" any already granted lock
> > without
> > > affecting already established locks.
> > >
> > > 2. Additionally, we'll keep a sorted list of non-overlapping
> > segments of
> > > locked regions. And we'll count, for each region, how many locks
> are
> > > referencing it. One lock can reference multiple segments, and each
> > > segment can be referenced by multiple locks.
> > >
> > > 3. An additional lock request that overlaps with an existing
> > segment,
> > > can cause this segment to be split to satisfy the non-overlapping
> > property.
> > >
> > > 4. When an unlock request is received, all segments intersecting
> > with
> > > the region are eliminated (it may require some segment splits on
> the
> > > edges), and the unlocked region is subtracted from each lock
> > associated
> > > to the segment. If a lock gets an empty region, it's removed.
> > >
> > > 5. We'll create a special "remove lock" request that doesn't
> > unlock a
> > > region but removes an already granted lock. This will decrease the
> > > number of references to each of the segments this lock was
> > covering. If
> > > some segment reaches 0, it's removed. Otherwise it remains there.
> > This
> > > special request will only be used internally to cancel already
> > acquired
> > > locks that cannot be fully granted due to quorum issues or any
> other
> > > problem.
> > >
> > > In some weird cases, the list of segments can be huge (many locks
> > > overlapping only on a single byte, so each segment represents
> > only one
> > > byte). We can try to find some smarter structure that minimizes
> this
> > > problem or limit the number of segments (for example returning
> > ENOLCK
> > > when there are too many).
> > >
> > > What do you think ?
> > >
> > > Xavi
> > >
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> > > https://lists.gluster.org/mailman/listinfo/gluster-devel
> > >
> >
> >
> >
> > --
> > Pranith
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190401/eb9ace4d/attachment-0001.html>
More information about the Gluster-devel
mailing list