[Gluster-devel] Issue with posix locks

Mon Apr 1 10:10:21 UTC 2019

On 4/1/19 2:23 PM, Xavi Hernandez wrote:
> On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri at redhat.com 
> <mailto:skoduri at redhat.com>> wrote:
> 
> 
> 
>     On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
>      >
>      >
>      > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri
>     <skoduri at redhat.com <mailto:skoduri at redhat.com>
>      > <mailto:skoduri at redhat.com <mailto:skoduri at redhat.com>>> wrote:
>      >
>      >
>      >
>      >     On 3/29/19 11:55 PM, Xavi Hernandez wrote:
>      >      > Hi all,
>      >      >
>      >      > there is one potential problem with posix locks when used in a
>      >      > replicated or dispersed volume.
>      >      >
>      >      > Some background:
>      >      >
>      >      > Posix locks allow any process to lock a region of a file
>     multiple
>      >     times,
>      >      > but a single unlock on a given region will release all
>     previous
>      >     locks.
>      >      > Locked regions can be different for each lock request and
>     they can
>      >      > overlap. The resulting lock will cover the union of all locked
>      >     regions.
>      >      > A single unlock (the region doesn't necessarily need to
>     match any
>      >     of the
>      >      > ranges used for locking) will create a "hole" in the currently
>      >     locked
>      >      > region, independently of how many times a lock request covered
>      >     that region.
>      >      >
>      >      > For this reason, the locks xlator simply combines the
>     locked regions
>      >      > that are requested, but it doesn't track each individual
>     lock range.
>      >      >
>      >      > Under normal circumstances this works fine. But there are
>     some cases
>      >      > where this behavior is not sufficient. For example, suppose we
>      >     have a
>      >      > replica 3 volume with quorum = 2. Given the special nature
>     of posix
>      >      > locks, AFR sends the lock request sequentially to each one
>     of the
>      >      > bricks, to avoid that conflicting lock requests from other
>      >     clients could
>      >      > require to unlock an already locked region on the client
>     that has
>      >     not
>      >      > got enough successful locks (i.e. quorum). An unlock here not
>      >     only would
>      >      > cancel the current lock request. It would also cancel any
>     previously
>      >      > acquired lock.
>      >      >
>      >
>      >     I may not have fully understood, please correct me. AFAIU, lk
>     xlator
>      >     merges locks only if both the lk-owner and the client opaque
>     matches.
>      >
>      >     In the case which you have mentioned above, considering clientA
>      >     acquired
>      >     locks on majority of quorum (say nodeA and nodeB) and clientB
>     on nodeC
>      >     alone- clientB now has to unlock/cancel the lock it acquired
>     on nodeC.
>      >
>      >     You are saying the it could pose a problem if there were already
>      >     successful locks taken by clientB for the same region which
>     would get
>      >     unlocked by this particular unlock request..right?
>      >
>      >     Assuming the previous locks acquired by clientB are shared
>     (otherwise
>      >     clientA wouldn't have got granted lock for the same region on
>     nodeA &
>      >     nodeB), they would still hold true on nodeA & nodeB  as the
>     unlock
>      >     request was sent to only nodeC. Since the majority of quorum
>     nodes
>      >     still
>      >     hold the locks by clientB, this isn't serious issue IMO.
>      >
>      >     I haven't looked into heal part but would like to understand
>     if this is
>      >     really an issue in normal scenarios as well.
>      >
>      >
>      > This is how I understood the code. Consider the following case:
>      > Nodes A, B, C have locks with start and end offsets: 5-15 from
>     mount-1
>      > and lock-range 2-3 from mount-2.
>      > If mount-1 requests nonblocking lock with lock-range 1-7 and in
>     parallel
>      > lets say mount-2 issued unlock of 2-3 as well.
>      >
>      > nodeA got unlock from mount-2 with range 2-3 then lock from
>     mount-1 with
>      > range 1-7, so the lock is granted and merged to give 1-15
>      > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3
>     which
>      > leads to EAGAIN which will trigger unlocks on granted lock in
>     mount-1
>      > which will end up doing unlock of 1-7 on nodeA leading to lock-range
>      > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and
>     nodeC will
>      > have range 5-15.
>      >
>      > Let me know if my understanding is wrong.
> 
>     Both of us mentioned the same points. So in the example you gave ,
>     mount-1 lost its previous lock on nodeA but majority of the quorum
>     (nodeB and nodeC) still have the previous lock  (range: 5-15)
>     intact. So
>     this shouldn't ideally lead to any issues as other conflicting locks
>     are
>     blocked or failed by majority of the nodes (provided there are no brick
>     dis/re-connects).
> 
> 
> But brick disconnects will happen (upgrades, disk failures, server 
> maintenance, ...). Anyway, even without brick disconnects, in the 
> previous example we have nodeA with range 8-15, and nodes B and C with 
> range 5-15. If another lock from mount-2 comes for range 5-7, it will 
> succeed on nodeA, but it will block on nodeB. At this point, mount-1 
> could attempt a lock on same range. It will block on nodeA, so we have a 
> deadlock.
> 
> In general, having discrepancies between bricks is not good because 
> sooner or later it will cause some bad inconsistency.
> 
> 
>     Wrt to brick disconnects/re-connects, if we can get in general lock
>     healing (not getting into implementation details atm) support, that
>     should take care of correcting lock range on nodeA as well right?
> 
> 
> The problem we have seen is that to be able to correctly heal currently 
> acquired locks on brick reconnect, there are cases where we need to 
> release a lock that has already been granted (because the current owner 
> doesn't have enough quorum and a just recovered connection tries to 
> claim/heal it). In this case we need to deal with locks that have 
> already been merged, but without interfering with other existing locks 
> that already have quorum.
> 

Okay. Thanks for the detailed explanation. That clears my doubts.

-Soumya