[Gluster-devel] Issue with posix locks

Mon Apr 1 08:53:23 UTC 2019

On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri at redhat.com> wrote:

>
>
> On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skoduri at redhat.com
> > <mailto:skoduri at redhat.com>> wrote:
> >
> >
> >
> >     On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> >      > Hi all,
> >      >
> >      > there is one potential problem with posix locks when used in a
> >      > replicated or dispersed volume.
> >      >
> >      > Some background:
> >      >
> >      > Posix locks allow any process to lock a region of a file multiple
> >     times,
> >      > but a single unlock on a given region will release all previous
> >     locks.
> >      > Locked regions can be different for each lock request and they can
> >      > overlap. The resulting lock will cover the union of all locked
> >     regions.
> >      > A single unlock (the region doesn't necessarily need to match any
> >     of the
> >      > ranges used for locking) will create a "hole" in the currently
> >     locked
> >      > region, independently of how many times a lock request covered
> >     that region.
> >      >
> >      > For this reason, the locks xlator simply combines the locked
> regions
> >      > that are requested, but it doesn't track each individual lock
> range.
> >      >
> >      > Under normal circumstances this works fine. But there are some
> cases
> >      > where this behavior is not sufficient. For example, suppose we
> >     have a
> >      > replica 3 volume with quorum = 2. Given the special nature of
> posix
> >      > locks, AFR sends the lock request sequentially to each one of the
> >      > bricks, to avoid that conflicting lock requests from other
> >     clients could
> >      > require to unlock an already locked region on the client that has
> >     not
> >      > got enough successful locks (i.e. quorum). An unlock here not
> >     only would
> >      > cancel the current lock request. It would also cancel any
> previously
> >      > acquired lock.
> >      >
> >
> >     I may not have fully understood, please correct me. AFAIU, lk xlator
> >     merges locks only if both the lk-owner and the client opaque matches.
> >
> >     In the case which you have mentioned above, considering clientA
> >     acquired
> >     locks on majority of quorum (say nodeA and nodeB) and clientB on
> nodeC
> >     alone- clientB now has to unlock/cancel the lock it acquired on
> nodeC.
> >
> >     You are saying the it could pose a problem if there were already
> >     successful locks taken by clientB for the same region which would get
> >     unlocked by this particular unlock request..right?
> >
> >     Assuming the previous locks acquired by clientB are shared (otherwise
> >     clientA wouldn't have got granted lock for the same region on nodeA &
> >     nodeB), they would still hold true on nodeA & nodeB  as the unlock
> >     request was sent to only nodeC. Since the majority of quorum nodes
> >     still
> >     hold the locks by clientB, this isn't serious issue IMO.
> >
> >     I haven't looked into heal part but would like to understand if this
> is
> >     really an issue in normal scenarios as well.
> >
> >
> > This is how I understood the code. Consider the following case:
> > Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1
> > and lock-range 2-3 from mount-2.
> > If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel
> > lets say mount-2 issued unlock of 2-3 as well.
> >
> > nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with
> > range 1-7, so the lock is granted and merged to give 1-15
> > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which
> > leads to EAGAIN which will trigger unlocks on granted lock in mount-1
> > which will end up doing unlock of 1-7 on nodeA leading to lock-range
> > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will
> > have range 5-15.
> >
> > Let me know if my understanding is wrong.
>
> Both of us mentioned the same points. So in the example you gave ,
> mount-1 lost its previous lock on nodeA but majority of the quorum
> (nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So
> this shouldn't ideally lead to any issues as other conflicting locks are
> blocked or failed by majority of the nodes (provided there are no brick
> dis/re-connects).
>

But brick disconnects will happen (upgrades, disk failures, server
maintenance, ...). Anyway, even without brick disconnects, in the previous
example we have nodeA with range 8-15, and nodes B and C with range 5-15.
If another lock from mount-2 comes for range 5-7, it will succeed on nodeA,
but it will block on nodeB. At this point, mount-1 could attempt a lock on
same range. It will block on nodeA, so we have a deadlock.

In general, having discrepancies between bricks is not good because sooner
or later it will cause some bad inconsistency.

> Wrt to brick disconnects/re-connects, if we can get in general lock
> healing (not getting into implementation details atm) support, that
> should take care of correcting lock range on nodeA as well right?
>

The problem we have seen is that to be able to correctly heal currently
acquired locks on brick reconnect, there are cases where we need to release
a lock that has already been granted (because the current owner doesn't
have enough quorum and a just recovered connection tries to claim/heal it).
In this case we need to deal with locks that have already been merged, but
without interfering with other existing locks that already have quorum.

> That said I am not suggesting that we should stick to existing behavior,
> just trying to get clarification to check if we can avoid any
> overhead/side-effects with maintaining multiple locks.
>

Right now is the only way we have found to provide a correct solution both
for some cases of concurrent lock/unlock requests, and lock healing.

Regards,

Xavi

> Thanks,
> Soumya
>
>
> >
> >
> >     Thanks,
> >     Soumya
> >
> >      > However, when something goes wrong (a brick dies during a lock
> >     request,
> >      > or there's a network partition or some other weird situation), it
> >     could
> >      > happen that even using sequential locking, only one brick
> >     succeeds the
> >      > lock request. In this case, AFR should cancel the previous lock
> >     (and it
> >      > does), but this also cancels any previously acquired lock on that
> >      > region, which is not good.
> >      >
> >      > A similar thing can happen if we try to recover (heal) posix
> >     locks that
> >      > were active after a brick has been disconnected (for any reason)
> and
> >      > then reconnected.
> >      >
> >      > To fix all these situations we need to change the way posix locks
> >     are
> >      > managed by locks xlator. One possibility would be to embed the
> lock
> >      > request inside an inode transaction using inodelk. Since inodelks
> >     do not
> >      > suffer this problem, the follwing posix lock could be sent safely.
> >      > However this implies an additional network request, which could
> >     cause
> >      > some performance impact. Eager-locking could minimize the impact
> >     in some
> >      > cases. However this approach won't work for lock recovery after a
> >      > disconnect.
> >      >
> >      > Another possibility is to send a special partial posix lock
> request
> >      > which won't be immediately merged with already existing locks once
> >      > granted. An additional confirmation request of the partial posix
> >     lock
> >      > will be required to fully grant the current lock and merge it
> >     with the
> >      > existing ones. This requires a new network request, which will add
> >      > latency, and makes everything more complex since there would be
> more
> >      > combinations of states in which something could fail.
> >      >
> >      > So I think one possible solution would be the following:
> >      >
> >      > 1. Keep each posix lock as an independent object in locks xlator.
> >     This
> >      > will make it possible to "invalidate" any already granted lock
> >     without
> >      > affecting already established locks.
> >      >
> >      > 2. Additionally, we'll keep a sorted list of non-overlapping
> >     segments of
> >      > locked regions. And we'll count, for each region, how many locks
> are
> >      > referencing it. One lock can reference multiple segments, and each
> >      > segment can be referenced by multiple locks.
> >      >
> >      > 3. An additional lock request that overlaps with an existing
> >     segment,
> >      > can cause this segment to be split to satisfy the non-overlapping
> >     property.
> >      >
> >      > 4. When an unlock request is received, all segments intersecting
> >     with
> >      > the region are eliminated (it may require some segment splits on
> the
> >      > edges), and the unlocked region is subtracted from each lock
> >     associated
> >      > to the segment. If a lock gets an empty region, it's removed.
> >      >
> >      > 5. We'll create a special "remove lock" request that doesn't
> >     unlock a
> >      > region but removes an already granted lock. This will decrease the
> >      > number of references to each of the segments this lock was
> >     covering. If
> >      > some segment reaches 0, it's removed. Otherwise it remains there.
> >     This
> >      > special request will only be used internally to cancel already
> >     acquired
> >      > locks that cannot be fully granted due to quorum issues or any
> other
> >      > problem.
> >      >
> >      > In some weird cases, the list of segments can be huge (many locks
> >      > overlapping only on a single byte, so each segment represents
> >     only one
> >      > byte). We can try to find some smarter structure that minimizes
> this
> >      > problem or limit the number of segments (for example returning
> >     ENOLCK
> >      > when there are too many).
> >      >
> >      > What do you think ?
> >      >
> >      > Xavi
> >      >
> >      > _______________________________________________
> >      > Gluster-devel mailing list
> >      > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >      > https://lists.gluster.org/mailman/listinfo/gluster-devel
> >      >
> >
> >
> >
> > --
> > Pranith
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190401/eb9ace4d/attachment-0001.html>