<div dir="ltr"><div dir="ltr">On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri <<a href="mailto:skoduri@redhat.com">skoduri@redhat.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:<br>
> <br>
> <br>
> On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <<a href="mailto:skoduri@redhat.com" target="_blank">skoduri@redhat.com</a> <br>
> <mailto:<a href="mailto:skoduri@redhat.com" target="_blank">skoduri@redhat.com</a>>> wrote:<br>
> <br>
> <br>
> <br>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:<br>
> > Hi all,<br>
> ><br>
> > there is one potential problem with posix locks when used in a<br>
> > replicated or dispersed volume.<br>
> ><br>
> > Some background:<br>
> ><br>
> > Posix locks allow any process to lock a region of a file multiple<br>
> times,<br>
> > but a single unlock on a given region will release all previous<br>
> locks.<br>
> > Locked regions can be different for each lock request and they can<br>
> > overlap. The resulting lock will cover the union of all locked<br>
> regions.<br>
> > A single unlock (the region doesn't necessarily need to match any<br>
> of the<br>
> > ranges used for locking) will create a "hole" in the currently<br>
> locked<br>
> > region, independently of how many times a lock request covered<br>
> that region.<br>
> ><br>
> > For this reason, the locks xlator simply combines the locked regions<br>
> > that are requested, but it doesn't track each individual lock range.<br>
> ><br>
> > Under normal circumstances this works fine. But there are some cases<br>
> > where this behavior is not sufficient. For example, suppose we<br>
> have a<br>
> > replica 3 volume with quorum = 2. Given the special nature of posix<br>
> > locks, AFR sends the lock request sequentially to each one of the<br>
> > bricks, to avoid that conflicting lock requests from other<br>
> clients could<br>
> > require to unlock an already locked region on the client that has<br>
> not<br>
> > got enough successful locks (i.e. quorum). An unlock here not<br>
> only would<br>
> > cancel the current lock request. It would also cancel any previously<br>
> > acquired lock.<br>
> ><br>
> <br>
> I may not have fully understood, please correct me. AFAIU, lk xlator<br>
> merges locks only if both the lk-owner and the client opaque matches.<br>
> <br>
> In the case which you have mentioned above, considering clientA<br>
> acquired<br>
> locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC<br>
> alone- clientB now has to unlock/cancel the lock it acquired on nodeC.<br>
> <br>
> You are saying the it could pose a problem if there were already<br>
> successful locks taken by clientB for the same region which would get<br>
> unlocked by this particular unlock request..right?<br>
> <br>
> Assuming the previous locks acquired by clientB are shared (otherwise<br>
> clientA wouldn't have got granted lock for the same region on nodeA &<br>
> nodeB), they would still hold true on nodeA & nodeB as the unlock<br>
> request was sent to only nodeC. Since the majority of quorum nodes<br>
> still<br>
> hold the locks by clientB, this isn't serious issue IMO.<br>
> <br>
> I haven't looked into heal part but would like to understand if this is<br>
> really an issue in normal scenarios as well.<br>
> <br>
> <br>
> This is how I understood the code. Consider the following case:<br>
> Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 <br>
> and lock-range 2-3 from mount-2.<br>
> If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel <br>
> lets say mount-2 issued unlock of 2-3 as well.<br>
> <br>
> nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with <br>
> range 1-7, so the lock is granted and merged to give 1-15<br>
> nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which <br>
> leads to EAGAIN which will trigger unlocks on granted lock in mount-1 <br>
> which will end up doing unlock of 1-7 on nodeA leading to lock-range <br>
> 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will <br>
> have range 5-15.<br>
> <br>
> Let me know if my understanding is wrong.<br>
<br>
Both of us mentioned the same points. So in the example you gave , <br>
mount-1 lost its previous lock on nodeA but majority of the quorum <br>
(nodeB and nodeC) still have the previous lock (range: 5-15) intact. So <br>
this shouldn't ideally lead to any issues as other conflicting locks are <br>
blocked or failed by majority of the nodes (provided there are no brick <br>
dis/re-connects).<br></blockquote><div><br></div><div>But brick disconnects will happen (upgrades, disk failures, server maintenance, ...). Anyway, even without brick disconnects, in the previous example we have nodeA with range 8-15, and nodes B and C with range 5-15. If another lock from mount-2 comes for range 5-7, it will succeed on nodeA, but it will block on nodeB. At this point, mount-1 could attempt a lock on same range. It will block on nodeA, so we have a deadlock.</div><div><br></div><div>In general, having discrepancies between bricks is not good because sooner or later it will cause some bad inconsistency.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Wrt to brick disconnects/re-connects, if we can get in general lock <br>
healing (not getting into implementation details atm) support, that <br>
should take care of correcting lock range on nodeA as well right?<br></blockquote><div><br></div><div>The problem we have seen is that to be able to correctly heal currently acquired locks on brick reconnect, there are cases where we need to release a lock that has already been granted (because the current owner doesn't have enough quorum and a just recovered connection tries to claim/heal it). In this case we need to deal with locks that have already been merged, but without interfering with other existing locks that already have quorum.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
That said I am not suggesting that we should stick to existing behavior, <br>
just trying to get clarification to check if we can avoid any <br>
overhead/side-effects with maintaining multiple locks.<br></blockquote><div><br></div><div>Right now is the only way we have found to provide a correct solution both for some cases of concurrent lock/unlock requests, and lock healing.</div><div><br></div><div>Regards,</div><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Thanks,<br>
Soumya<br>
<br>
<br>
> <br>
> <br>
> Thanks,<br>
> Soumya<br>
> <br>
> > However, when something goes wrong (a brick dies during a lock<br>
> request,<br>
> > or there's a network partition or some other weird situation), it<br>
> could<br>
> > happen that even using sequential locking, only one brick<br>
> succeeds the<br>
> > lock request. In this case, AFR should cancel the previous lock<br>
> (and it<br>
> > does), but this also cancels any previously acquired lock on that<br>
> > region, which is not good.<br>
> ><br>
> > A similar thing can happen if we try to recover (heal) posix<br>
> locks that<br>
> > were active after a brick has been disconnected (for any reason) and<br>
> > then reconnected.<br>
> ><br>
> > To fix all these situations we need to change the way posix locks<br>
> are<br>
> > managed by locks xlator. One possibility would be to embed the lock<br>
> > request inside an inode transaction using inodelk. Since inodelks<br>
> do not<br>
> > suffer this problem, the follwing posix lock could be sent safely.<br>
> > However this implies an additional network request, which could<br>
> cause<br>
> > some performance impact. Eager-locking could minimize the impact<br>
> in some<br>
> > cases. However this approach won't work for lock recovery after a<br>
> > disconnect.<br>
> ><br>
> > Another possibility is to send a special partial posix lock request<br>
> > which won't be immediately merged with already existing locks once<br>
> > granted. An additional confirmation request of the partial posix<br>
> lock<br>
> > will be required to fully grant the current lock and merge it<br>
> with the<br>
> > existing ones. This requires a new network request, which will add<br>
> > latency, and makes everything more complex since there would be more<br>
> > combinations of states in which something could fail.<br>
> ><br>
> > So I think one possible solution would be the following:<br>
> ><br>
> > 1. Keep each posix lock as an independent object in locks xlator.<br>
> This<br>
> > will make it possible to "invalidate" any already granted lock<br>
> without<br>
> > affecting already established locks.<br>
> ><br>
> > 2. Additionally, we'll keep a sorted list of non-overlapping<br>
> segments of<br>
> > locked regions. And we'll count, for each region, how many locks are<br>
> > referencing it. One lock can reference multiple segments, and each<br>
> > segment can be referenced by multiple locks.<br>
> ><br>
> > 3. An additional lock request that overlaps with an existing<br>
> segment,<br>
> > can cause this segment to be split to satisfy the non-overlapping<br>
> property.<br>
> ><br>
> > 4. When an unlock request is received, all segments intersecting<br>
> with<br>
> > the region are eliminated (it may require some segment splits on the<br>
> > edges), and the unlocked region is subtracted from each lock<br>
> associated<br>
> > to the segment. If a lock gets an empty region, it's removed.<br>
> ><br>
> > 5. We'll create a special "remove lock" request that doesn't<br>
> unlock a<br>
> > region but removes an already granted lock. This will decrease the<br>
> > number of references to each of the segments this lock was<br>
> covering. If<br>
> > some segment reaches 0, it's removed. Otherwise it remains there.<br>
> This<br>
> > special request will only be used internally to cancel already<br>
> acquired<br>
> > locks that cannot be fully granted due to quorum issues or any other<br>
> > problem.<br>
> ><br>
> > In some weird cases, the list of segments can be huge (many locks<br>
> > overlapping only on a single byte, so each segment represents<br>
> only one<br>
> > byte). We can try to find some smarter structure that minimizes this<br>
> > problem or limit the number of segments (for example returning<br>
> ENOLCK<br>
> > when there are too many).<br>
> ><br>
> > What do you think ?<br>
> ><br>
> > Xavi<br>
> ><br>
> > _______________________________________________<br>
> > Gluster-devel mailing list<br>
> > <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a> <mailto:<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a>><br>
> > <a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
> ><br>
> <br>
> <br>
> <br>
> -- <br>
> Pranith<br>
</blockquote></div></div>