[Gluster-devel] Issue with posix locks

Mon Apr 1 08:15:21 UTC 2019

On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> 
> 
> On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skoduri at redhat.com 
> <mailto:skoduri at redhat.com>> wrote:
> 
> 
> 
>     On 3/29/19 11:55 PM, Xavi Hernandez wrote:
>      > Hi all,
>      >
>      > there is one potential problem with posix locks when used in a
>      > replicated or dispersed volume.
>      >
>      > Some background:
>      >
>      > Posix locks allow any process to lock a region of a file multiple
>     times,
>      > but a single unlock on a given region will release all previous
>     locks.
>      > Locked regions can be different for each lock request and they can
>      > overlap. The resulting lock will cover the union of all locked
>     regions.
>      > A single unlock (the region doesn't necessarily need to match any
>     of the
>      > ranges used for locking) will create a "hole" in the currently
>     locked
>      > region, independently of how many times a lock request covered
>     that region.
>      >
>      > For this reason, the locks xlator simply combines the locked regions
>      > that are requested, but it doesn't track each individual lock range.
>      >
>      > Under normal circumstances this works fine. But there are some cases
>      > where this behavior is not sufficient. For example, suppose we
>     have a
>      > replica 3 volume with quorum = 2. Given the special nature of posix
>      > locks, AFR sends the lock request sequentially to each one of the
>      > bricks, to avoid that conflicting lock requests from other
>     clients could
>      > require to unlock an already locked region on the client that has
>     not
>      > got enough successful locks (i.e. quorum). An unlock here not
>     only would
>      > cancel the current lock request. It would also cancel any previously
>      > acquired lock.
>      >
> 
>     I may not have fully understood, please correct me. AFAIU, lk xlator
>     merges locks only if both the lk-owner and the client opaque matches.
> 
>     In the case which you have mentioned above, considering clientA
>     acquired
>     locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
>     alone- clientB now has to unlock/cancel the lock it acquired on nodeC.
> 
>     You are saying the it could pose a problem if there were already
>     successful locks taken by clientB for the same region which would get
>     unlocked by this particular unlock request..right?
> 
>     Assuming the previous locks acquired by clientB are shared (otherwise
>     clientA wouldn't have got granted lock for the same region on nodeA &
>     nodeB), they would still hold true on nodeA & nodeB  as the unlock
>     request was sent to only nodeC. Since the majority of quorum nodes
>     still
>     hold the locks by clientB, this isn't serious issue IMO.
> 
>     I haven't looked into heal part but would like to understand if this is
>     really an issue in normal scenarios as well.
> 
> 
> This is how I understood the code. Consider the following case:
> Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 
> and lock-range 2-3 from mount-2.
> If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel 
> lets say mount-2 issued unlock of 2-3 as well.
> 
> nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with 
> range 1-7, so the lock is granted and merged to give 1-15
> nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which 
> leads to EAGAIN which will trigger unlocks on granted lock in mount-1 
> which will end up doing unlock of 1-7 on nodeA leading to lock-range 
> 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will 
> have range 5-15.
> 
> Let me know if my understanding is wrong.

Both of us mentioned the same points. So in the example you gave , 
mount-1 lost its previous lock on nodeA but majority of the quorum 
(nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So 
this shouldn't ideally lead to any issues as other conflicting locks are 
blocked or failed by majority of the nodes (provided there are no brick 
dis/re-connects).

Wrt to brick disconnects/re-connects, if we can get in general lock 
healing (not getting into implementation details atm) support, that 
should take care of correcting lock range on nodeA as well right?

That said I am not suggesting that we should stick to existing behavior, 
just trying to get clarification to check if we can avoid any 
overhead/side-effects with maintaining multiple locks.

Thanks,
Soumya

> 
> 
>     Thanks,
>     Soumya
> 
>      > However, when something goes wrong (a brick dies during a lock
>     request,
>      > or there's a network partition or some other weird situation), it
>     could
>      > happen that even using sequential locking, only one brick
>     succeeds the
>      > lock request. In this case, AFR should cancel the previous lock
>     (and it
>      > does), but this also cancels any previously acquired lock on that
>      > region, which is not good.
>      >
>      > A similar thing can happen if we try to recover (heal) posix
>     locks that
>      > were active after a brick has been disconnected (for any reason) and
>      > then reconnected.
>      >
>      > To fix all these situations we need to change the way posix locks
>     are
>      > managed by locks xlator. One possibility would be to embed the lock
>      > request inside an inode transaction using inodelk. Since inodelks
>     do not
>      > suffer this problem, the follwing posix lock could be sent safely.
>      > However this implies an additional network request, which could
>     cause
>      > some performance impact. Eager-locking could minimize the impact
>     in some
>      > cases. However this approach won't work for lock recovery after a
>      > disconnect.
>      >
>      > Another possibility is to send a special partial posix lock request
>      > which won't be immediately merged with already existing locks once
>      > granted. An additional confirmation request of the partial posix
>     lock
>      > will be required to fully grant the current lock and merge it
>     with the
>      > existing ones. This requires a new network request, which will add
>      > latency, and makes everything more complex since there would be more
>      > combinations of states in which something could fail.
>      >
>      > So I think one possible solution would be the following:
>      >
>      > 1. Keep each posix lock as an independent object in locks xlator.
>     This
>      > will make it possible to "invalidate" any already granted lock
>     without
>      > affecting already established locks.
>      >
>      > 2. Additionally, we'll keep a sorted list of non-overlapping
>     segments of
>      > locked regions. And we'll count, for each region, how many locks are
>      > referencing it. One lock can reference multiple segments, and each
>      > segment can be referenced by multiple locks.
>      >
>      > 3. An additional lock request that overlaps with an existing
>     segment,
>      > can cause this segment to be split to satisfy the non-overlapping
>     property.
>      >
>      > 4. When an unlock request is received, all segments intersecting
>     with
>      > the region are eliminated (it may require some segment splits on the
>      > edges), and the unlocked region is subtracted from each lock
>     associated
>      > to the segment. If a lock gets an empty region, it's removed.
>      >
>      > 5. We'll create a special "remove lock" request that doesn't
>     unlock a
>      > region but removes an already granted lock. This will decrease the
>      > number of references to each of the segments this lock was
>     covering. If
>      > some segment reaches 0, it's removed. Otherwise it remains there.
>     This
>      > special request will only be used internally to cancel already
>     acquired
>      > locks that cannot be fully granted due to quorum issues or any other
>      > problem.
>      >
>      > In some weird cases, the list of segments can be huge (many locks
>      > overlapping only on a single byte, so each segment represents
>     only one
>      > byte). We can try to find some smarter structure that minimizes this
>      > problem or limit the number of segments (for example returning
>     ENOLCK
>      > when there are too many).
>      >
>      > What do you think ?
>      >
>      > Xavi
>      >
>      > _______________________________________________
>      > Gluster-devel mailing list
>      > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>      > https://lists.gluster.org/mailman/listinfo/gluster-devel
>      >
> 
> 
> 
> -- 
> Pranith