<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <<a href="mailto:skoduri@redhat.com">skoduri@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
On 3/29/19 11:55 PM, Xavi Hernandez wrote:<br>
> Hi all,<br>
> <br>
> there is one potential problem with posix locks when used in a <br>
> replicated or dispersed volume.<br>
> <br>
> Some background:<br>
> <br>
> Posix locks allow any process to lock a region of a file multiple times, <br>
> but a single unlock on a given region will release all previous locks. <br>
> Locked regions can be different for each lock request and they can <br>
> overlap. The resulting lock will cover the union of all locked regions. <br>
> A single unlock (the region doesn't necessarily need to match any of the <br>
> ranges used for locking) will create a "hole" in the currently locked <br>
> region, independently of how many times a lock request covered that region.<br>
> <br>
> For this reason, the locks xlator simply combines the locked regions <br>
> that are requested, but it doesn't track each individual lock range.<br>
> <br>
> Under normal circumstances this works fine. But there are some cases <br>
> where this behavior is not sufficient. For example, suppose we have a <br>
> replica 3 volume with quorum = 2. Given the special nature of posix <br>
> locks, AFR sends the lock request sequentially to each one of the <br>
> bricks, to avoid that conflicting lock requests from other clients could <br>
> require to unlock an already locked region on the client that has not <br>
> got enough successful locks (i.e. quorum). An unlock here not only would <br>
> cancel the current lock request. It would also cancel any previously <br>
> acquired lock.<br>
> <br>
<br>
I may not have fully understood, please correct me. AFAIU, lk xlator <br>
merges locks only if both the lk-owner and the client opaque matches.<br>
<br>
In the case which you have mentioned above, considering clientA acquired <br>
locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC <br>
alone- clientB now has to unlock/cancel the lock it acquired on nodeC.<br>
<br>
You are saying the it could pose a problem if there were already <br>
successful locks taken by clientB for the same region which would get <br>
unlocked by this particular unlock request..right?<br>
<br>
Assuming the previous locks acquired by clientB are shared (otherwise <br>
clientA wouldn't have got granted lock for the same region on nodeA & <br>
nodeB), they would still hold true on nodeA & nodeBĀ as the unlock <br>
request was sent to only nodeC. Since the majority of quorum nodes still <br>
hold the locks by clientB, this isn't serious issue IMO.<br>
<br>
I haven't looked into heal part but would like to understand if this is <br>
really an issue in normal scenarios as well.<br></blockquote><div><br></div><div>This is how I understood the code. Consider the following case:</div><div>Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 and lock-range 2-3 from mount-2.</div><div>If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel lets say mount-2 issued unlock of 2-3 as well.<br></div><div><br></div><div>nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with range 1-7, so the lock is granted and merged to give 1-15</div><div>nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which leads to EAGAIN which will trigger unlocks on granted lock in mount-1 which will end up doing unlock of 1-7 on nodeA leading to lock-range 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will have range 5-15.</div><div><br></div><div>Let me know if my understanding is wrong.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Thanks,<br>
Soumya<br>
<br>
> However, when something goes wrong (a brick dies during a lock request, <br>
> or there's a network partition or some other weird situation), it could <br>
> happen that even using sequential locking, only one brick succeeds the <br>
> lock request. In this case, AFR should cancel the previous lock (and it <br>
> does), but this also cancels any previously acquired lock on that <br>
> region, which is not good.<br>
> <br>
> A similar thing can happen if we try to recover (heal) posix locks that <br>
> were active after a brick has been disconnected (for any reason) and <br>
> then reconnected.<br>
> <br>
> To fix all these situations we need to change the way posix locks are <br>
> managed by locks xlator. One possibility would be to embed the lock <br>
> request inside an inode transaction using inodelk. Since inodelks do not <br>
> suffer this problem, the follwing posix lock could be sent safely. <br>
> However this implies an additional network request, which could cause <br>
> some performance impact. Eager-locking could minimize the impact in some <br>
> cases. However this approach won't work for lock recovery after a <br>
> disconnect.<br>
> <br>
> Another possibility is to send a special partial posix lock request <br>
> which won't be immediately merged with already existing locks once <br>
> granted. An additional confirmation request of the partial posix lock <br>
> will be required to fully grant the current lock and merge it with the <br>
> existing ones. This requires a new network request, which will add <br>
> latency, and makes everything more complex since there would be more <br>
> combinations of states in which something could fail.<br>
> <br>
> So I think one possible solution would be the following:<br>
> <br>
> 1. Keep each posix lock as an independent object in locks xlator. This <br>
> will make it possible to "invalidate" any already granted lock without <br>
> affecting already established locks.<br>
> <br>
> 2. Additionally, we'll keep a sorted list of non-overlapping segments of <br>
> locked regions. And we'll count, for each region, how many locks are <br>
> referencing it. One lock can reference multiple segments, and each <br>
> segment can be referenced by multiple locks.<br>
> <br>
> 3. An additional lock request that overlaps with an existing segment, <br>
> can cause this segment to be split to satisfy the non-overlapping property.<br>
> <br>
> 4. When an unlock request is received, all segments intersecting with <br>
> the region are eliminated (it may require some segment splits on the <br>
> edges), and the unlocked region is subtracted from each lock associated <br>
> to the segment. If a lock gets an empty region, it's removed.<br>
> <br>
> 5. We'll create a special "remove lock" request that doesn't unlock a <br>
> region but removes an already granted lock. This will decrease the <br>
> number of references to each of the segments this lock was covering. If <br>
> some segment reaches 0, it's removed. Otherwise it remains there. This <br>
> special request will only be used internally to cancel already acquired <br>
> locks that cannot be fully granted due to quorum issues or any other <br>
> problem.<br>
> <br>
> In some weird cases, the list of segments can be huge (many locks <br>
> overlapping only on a single byte, so each segment represents only one <br>
> byte). We can try to find some smarter structure that minimizes this <br>
> problem or limit the number of segments (for example returning ENOLCK <br>
> when there are too many).<br>
> <br>
> What do you think ?<br>
> <br>
> Xavi<br>
> <br>
> _______________________________________________<br>
> Gluster-devel mailing list<br>
> <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
> <a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
> <br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Pranith<br></div></div></div>