[Gluster-devel] Issue with posix locks
Soumya Koduri
skoduri at redhat.com
Mon Apr 1 10:10:21 UTC 2019
On 4/1/19 2:23 PM, Xavi Hernandez wrote:
> On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <skoduri at redhat.com
> <mailto:skoduri at redhat.com>> wrote:
>
>
>
> On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri
> <skoduri at redhat.com <mailto:skoduri at redhat.com>
> > <mailto:skoduri at redhat.com <mailto:skoduri at redhat.com>>> wrote:
> >
> >
> >
> > On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > > Hi all,
> > >
> > > there is one potential problem with posix locks when used in a
> > > replicated or dispersed volume.
> > >
> > > Some background:
> > >
> > > Posix locks allow any process to lock a region of a file
> multiple
> > times,
> > > but a single unlock on a given region will release all
> previous
> > locks.
> > > Locked regions can be different for each lock request and
> they can
> > > overlap. The resulting lock will cover the union of all locked
> > regions.
> > > A single unlock (the region doesn't necessarily need to
> match any
> > of the
> > > ranges used for locking) will create a "hole" in the currently
> > locked
> > > region, independently of how many times a lock request covered
> > that region.
> > >
> > > For this reason, the locks xlator simply combines the
> locked regions
> > > that are requested, but it doesn't track each individual
> lock range.
> > >
> > > Under normal circumstances this works fine. But there are
> some cases
> > > where this behavior is not sufficient. For example, suppose we
> > have a
> > > replica 3 volume with quorum = 2. Given the special nature
> of posix
> > > locks, AFR sends the lock request sequentially to each one
> of the
> > > bricks, to avoid that conflicting lock requests from other
> > clients could
> > > require to unlock an already locked region on the client
> that has
> > not
> > > got enough successful locks (i.e. quorum). An unlock here not
> > only would
> > > cancel the current lock request. It would also cancel any
> previously
> > > acquired lock.
> > >
> >
> > I may not have fully understood, please correct me. AFAIU, lk
> xlator
> > merges locks only if both the lk-owner and the client opaque
> matches.
> >
> > In the case which you have mentioned above, considering clientA
> > acquired
> > locks on majority of quorum (say nodeA and nodeB) and clientB
> on nodeC
> > alone- clientB now has to unlock/cancel the lock it acquired
> on nodeC.
> >
> > You are saying the it could pose a problem if there were already
> > successful locks taken by clientB for the same region which
> would get
> > unlocked by this particular unlock request..right?
> >
> > Assuming the previous locks acquired by clientB are shared
> (otherwise
> > clientA wouldn't have got granted lock for the same region on
> nodeA &
> > nodeB), they would still hold true on nodeA & nodeB as the
> unlock
> > request was sent to only nodeC. Since the majority of quorum
> nodes
> > still
> > hold the locks by clientB, this isn't serious issue IMO.
> >
> > I haven't looked into heal part but would like to understand
> if this is
> > really an issue in normal scenarios as well.
> >
> >
> > This is how I understood the code. Consider the following case:
> > Nodes A, B, C have locks with start and end offsets: 5-15 from
> mount-1
> > and lock-range 2-3 from mount-2.
> > If mount-1 requests nonblocking lock with lock-range 1-7 and in
> parallel
> > lets say mount-2 issued unlock of 2-3 as well.
> >
> > nodeA got unlock from mount-2 with range 2-3 then lock from
> mount-1 with
> > range 1-7, so the lock is granted and merged to give 1-15
> > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3
> which
> > leads to EAGAIN which will trigger unlocks on granted lock in
> mount-1
> > which will end up doing unlock of 1-7 on nodeA leading to lock-range
> > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and
> nodeC will
> > have range 5-15.
> >
> > Let me know if my understanding is wrong.
>
> Both of us mentioned the same points. So in the example you gave ,
> mount-1 lost its previous lock on nodeA but majority of the quorum
> (nodeB and nodeC) still have the previous lock (range: 5-15)
> intact. So
> this shouldn't ideally lead to any issues as other conflicting locks
> are
> blocked or failed by majority of the nodes (provided there are no brick
> dis/re-connects).
>
>
> But brick disconnects will happen (upgrades, disk failures, server
> maintenance, ...). Anyway, even without brick disconnects, in the
> previous example we have nodeA with range 8-15, and nodes B and C with
> range 5-15. If another lock from mount-2 comes for range 5-7, it will
> succeed on nodeA, but it will block on nodeB. At this point, mount-1
> could attempt a lock on same range. It will block on nodeA, so we have a
> deadlock.
>
> In general, having discrepancies between bricks is not good because
> sooner or later it will cause some bad inconsistency.
>
>
> Wrt to brick disconnects/re-connects, if we can get in general lock
> healing (not getting into implementation details atm) support, that
> should take care of correcting lock range on nodeA as well right?
>
>
> The problem we have seen is that to be able to correctly heal currently
> acquired locks on brick reconnect, there are cases where we need to
> release a lock that has already been granted (because the current owner
> doesn't have enough quorum and a just recovered connection tries to
> claim/heal it). In this case we need to deal with locks that have
> already been merged, but without interfering with other existing locks
> that already have quorum.
>
Okay. Thanks for the detailed explanation. That clears my doubts.
-Soumya
More information about the Gluster-devel
mailing list