<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <<a href="mailto:jahernan@redhat.com">jahernan@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <<a href="mailto:jahernan@redhat.com" target="_blank">jahernan@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <<a href="mailto:jahernan@redhat.com" target="_blank">jahernan@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <<a href="mailto:jahernan@redhat.com" target="_blank">jahernan@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Raghavendra,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>All,<br><br>Glusterfs cleans up POSIX locks held on an fd when the client/mount
through which those locks are held disconnects from bricks/server. This
helps Glusterfs to not run into a stale lock problem later (For eg., if
application unlocks while the connection was still down). However, this
means the lock is no longer exclusive as other applications/clients can
acquire the same lock. To communicate that locks are no longer valid, we
are planning to mark the fd (which has POSIX locks) bad on a disconnect
so that any future operations on that fd will fail, forcing the
application to re-open the fd and re-acquire locks it needs [1].<br></div></div></blockquote><div><br></div><div>Wouldn't it be better to retake the locks when the brick is reconnected if the lock is still in use ?</div></div></div></blockquote><div><br></div><div>There is also a possibility that clients may never reconnect. That's the primary reason why bricks assume the worst (client will not reconnect) and cleanup the locks.<br></div></div></div></blockquote><div><br></div><div>True, so it's fine to cleanup the locks. I'm not saying that locks shouldn't be released on disconnect. The assumption is that if the client has really died, it will also disconnect from other bricks, who will release the locks. So, eventually, another client will have enough quorum to attempt a lock that will succeed. In other words, if a client gets disconnected from too many bricks simultaneously (loses Quorum), then that client can be considered as bad and can return errors to the application. This should also cause to release the locks on the remaining connected bricks.</div><div><br></div><div>On the other hand, if the disconnection is very short and the client has not died, it will keep enough locked files (it has quorum) to avoid other clients to successfully acquire a lock. In this case, if the brick is reconnected, all existing locks should be reacquired to recover the original state before the disconnection.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>BTW, the referenced bug is not public. Should we open another bug to track this ?</div></div></div></blockquote><div><br></div><div>I've just opened up the comment to give enough context. I'll open a bug upstream too.<br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br>Note that with AFR/replicate in picture we can prevent errors to application as long as Quorum number of children "never ever" lost connection with bricks after locks have been acquired. I am using the term "never ever" as locks are not healed back after re-connection and hence first disconnect would've marked the fd bad and the fd remains so even after re-connection happens. So, its not just Quorum number of children "currently online", but Quorum number of children "never having disconnected with bricks after locks are acquired".<br></div></div></blockquote><div><br></div><div>I think this requisite is not feasible. In a distributed file system, sooner or later all bricks will be disconnected. It could be because of failures or because an upgrade is done, but it will happen.</div><div><br></div><div>The difference here is how long are fd's kept open. If applications open and close files frequently enough (i.e. the fd is not kept open more time than it takes to have more than Quorum bricks disconnected) then there's no problem. The problem can only appear on applications that open files for a long time and also use posix locks. In this case, the only good solution I see is to retake the locks on brick reconnection.</div></div></div></blockquote><div><br></div><div>Agree. But lock-healing should be done only by HA layers like AFR/EC as only they know whether there are enough online bricks to have prevented any conflicting lock. Protocol/client itself doesn't have enough information to do that. If its a plain distribute, I don't see a way to heal locks without loosing the property of exclusivity of locks.<br></div></div></div></blockquote><div><br></div><div>Lock-healing of locks acquired while a brick was disconnected need to be handled by AFR/EC. However, locks already present at the moment of disconnection could be recovered by client xlator itself as long as the file has not been closed (which client xlator already knows).</div></div></div></blockquote><div><br></div><div>What if another client (say mount-2) took locks at the time of disconnect from mount-1 and modified the file and unlocked? client xlator doing the heal may not be a good idea.<br></div></div></div></blockquote><div><br></div><div>To avoid that we should ensure that any lock/unlocks are sent to the client, even if we know it's disconnected, so that client xlator can track them. The alternative is to duplicate and maintain code both on AFR and EC (and not sure if even in DHT depending on how we want to handle some cases). <br></div></div></div></blockquote><div><br></div><div>Didn't understand the solution. I wanted to highlight that client xlator by itself can't make a decision about healing locks because it doesn't know what happened on other replicas. If we have replica-3 volume and all 3 bricks get disconnected to their respective bricks. Now another mount process can take a lock on that file modify it and unlock. Now upon reconnection, the old mount process which had locks would think it always had the lock if client xlator independently tries to heal its own locks because file is not closed on it so far. But that is wrong. Let me know if it makes sense....<br></div></div></div></blockquote><div><br></div><div>My point of view is that any configuration with these requirements will have an appropriate quorum value so that it's impossible to have two or more partitions of the nodes working at the same time. So, under this assumptions, mount-1 can be in two situations:</div><div><br></div><div>1. It has lost a single brick and it's still operational. The other bricks will continue locked and everything should work fine from the point of view of the application. Any other application trying to get a lock will fail due to lack of quorum. When the lost brick comes back and is reconnected, client xlator will still have the fd reference and locks taken (unless the application has released the lock or closed the fd, in which case client xlator should get notified and clear that information), so it should be able to recover the previous state.</div><div><br></div><div>2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and any operation going to that file should fail with EIO. AFR should send a special request to client xlator so that it forgets any fd's and locks for that file. If bricks reconnect after that, no fd reopen or lock recovery will happen. Eventually the application should close the fd and retry later. This may succeed to not, depending on whether mount-2 has taken the lock already or not.</div><div><br></div><div>So, it's true that client xlator doesn't know the state of the other bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum and updates client xlator when quorum is lost.</div></div></div></blockquote><div><br></div><div>Just curious. Is there any reason why you think delegating the actual responsibility of re-opening or forgetting the locks to protocol/client is better when compared to AFR/EC doing the actual work of re-opening files and reacquiring locks? Asking this because, in the case of plain distribute, DHT will also have to indicate Quorum loss on every disconnect (as Quorum consisted of just 1 brick).<br><br></div><div>From what I understand, the design is the same one which me, Pranith, Anoop and Vijay had discussed (in essence) but varies in implementation details.<br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>I haven't worked out all the details of this approach, but I think it should work and it's simpler to maintain than trying to do the same for AFR and EC.</div><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>A similar thing could be done for open fd, since the current solution duplicates code in AFR and EC, but this is another topic...</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>What I proposed is a short term solution. mid to long term solution should be lock healing feature implemented in AFR/EC. In fact I had this conversation with <a class="gmail_plusreply" id="gmail-m_-7303584048976714520gmail-m_6767185538127833152gmail-m_-6925379864911849292gmail-m_2647883271274464408gmail-m_-3376516905399996294gmail-m_5087482179637597128plusReplyChip-0" href="mailto:pkarampu@redhat.com" target="_blank">+Karampuri, Pranith</a> before posting this msg to ML.<br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br>However, this use case is not affected if the application don't
acquire any POSIX locks. So, I am interested in knowing <br>* whether your use cases use POSIX locks?<br></div>* Is it feasible for your application to re-open fds and re-acquire locks on seeing EBADFD errors?<br></div></blockquote><div><br></div><div>I think that many applications are not prepared to handle that.</div></div></div></blockquote><div><br></div><div>I too suspected that and in fact not too happy with the solution. But went ahead with this mail as I heard implementing lock-heal in AFR will take time and hence there are no alternative short term solutions.</div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>Xavi</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br>[1] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7</a><br><div><br></div><div>regards,<br></div><div>Raghavendra<div class="gmail-m_-7303584048976714520gmail-m_6767185538127833152gmail-m_-6925379864911849292gmail-m_2647883271274464408gmail-m_-3376516905399996294gmail-m_5087482179637597128gmail-m_2829317191536857760gmail-m_-100542626364393863gmail-adL"><br></div></div></div></div>
_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div>
</blockquote></div></div>
</blockquote></div></div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail-m_-7303584048976714520gmail-m_6767185538127833152gmail-m_-6925379864911849292gmail-m_2647883271274464408gmail_signature"><div dir="ltr">Pranith<br></div></div></div>
</blockquote></div></div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail-m_-7303584048976714520gmail-m_6767185538127833152gmail_signature"><div dir="ltr">Pranith<br></div></div></div>
</blockquote></div></div>
</blockquote></div></div>