[Gluster-users] POSIX locks and disconnections between clients and bricks

Thu Mar 28 09:18:12 UTC 2019

On Thu, Mar 28, 2019 at 2:37 PM Xavi Hernandez <jahernan at redhat.com> wrote:

> On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <jahernan at redhat.com>
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>>> pkarampu at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <jahernan at redhat.com>
>>>> wrote:
>>>>
>>>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>>>> pkarampu at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jahernan at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <
>>>>>>>> jahernan at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Raghavendra,
>>>>>>>>>
>>>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> All,
>>>>>>>>>>
>>>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the
>>>>>>>>>> client/mount through which those locks are held disconnects from
>>>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock problem
>>>>>>>>>> later (For eg., if application unlocks while the connection was still
>>>>>>>>>> down). However, this means the lock is no longer exclusive as other
>>>>>>>>>> applications/clients can acquire the same lock. To communicate that locks
>>>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX locks)
>>>>>>>>>> bad on a disconnect so that any future operations on that fd will fail,
>>>>>>>>>> forcing the application to re-open the fd and re-acquire locks it needs [1].
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wouldn't it be better to retake the locks when the brick is
>>>>>>>>> reconnected if the lock is still in use ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> There is also  a possibility that clients may never reconnect.
>>>>>>>> That's the primary reason why bricks assume the worst (client will not
>>>>>>>> reconnect) and cleanup the locks.
>>>>>>>>
>>>>>>>
>>>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>>>>>> shouldn't be released on disconnect. The assumption is that if the client
>>>>>>> has really died, it will also disconnect from other bricks, who will
>>>>>>> release the locks. So, eventually, another client will have enough quorum
>>>>>>> to attempt a lock that will succeed. In other words, if a client gets
>>>>>>> disconnected from too many bricks simultaneously (loses Quorum), then that
>>>>>>> client can be considered as bad and can return errors to the application.
>>>>>>> This should also cause to release the locks on the remaining connected
>>>>>>> bricks.
>>>>>>>
>>>>>>> On the other hand, if the disconnection is very short and the client
>>>>>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>>>>>> other clients to successfully acquire a lock. In this case, if the brick is
>>>>>>> reconnected, all existing locks should be reacquired to recover the
>>>>>>> original state before the disconnection.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> BTW, the referenced bug is not public. Should we open another bug
>>>>>>>>> to track this ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I've just opened up the comment to give enough context. I'll open a
>>>>>>>> bug upstream too.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>>>>>>>> application as long as Quorum number of children "never ever" lost
>>>>>>>>>> connection with bricks after locks have been acquired. I am using the term
>>>>>>>>>> "never ever" as locks are not healed back after re-connection and hence
>>>>>>>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>>>>>>>> after re-connection happens. So, its not just Quorum number of children
>>>>>>>>>> "currently online", but Quorum number of children "never having
>>>>>>>>>> disconnected with bricks after locks are acquired".
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this requisite is not feasible. In a distributed file
>>>>>>>>> system, sooner or later all bricks will be disconnected. It could be
>>>>>>>>> because of failures or because an upgrade is done, but it will happen.
>>>>>>>>>
>>>>>>>>> The difference here is how long are fd's kept open. If
>>>>>>>>> applications open and close files frequently enough (i.e. the fd is not
>>>>>>>>> kept open more time than it takes to have more than Quorum bricks
>>>>>>>>> disconnected) then there's no problem. The problem can only appear on
>>>>>>>>> applications that open files for a long time and also use posix locks. In
>>>>>>>>> this case, the only good solution I see is to retake the locks on brick
>>>>>>>>> reconnection.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Agree. But lock-healing should be done only by HA layers like
>>>>>>>> AFR/EC as only they know whether there are enough online bricks to have
>>>>>>>> prevented any conflicting lock. Protocol/client itself doesn't have enough
>>>>>>>> information to do that. If its a plain distribute, I don't see a way to
>>>>>>>> heal locks without loosing the property of exclusivity of locks.
>>>>>>>>
>>>>>>>
>>>>>>> Lock-healing of locks acquired while a brick was disconnected need
>>>>>>> to be handled by AFR/EC. However, locks already present at the moment of
>>>>>>> disconnection could be recovered by client xlator itself as long as the
>>>>>>> file has not been closed (which client xlator already knows).
>>>>>>>
>>>>>>
>>>>>> What if another client (say mount-2) took locks at the time of
>>>>>> disconnect from mount-1 and modified the file and unlocked? client xlator
>>>>>> doing the heal may not be a good idea.
>>>>>>
>>>>>
>>>>> To avoid that we should ensure that any lock/unlocks are sent to the
>>>>> client, even if we know it's disconnected, so that client xlator can track
>>>>> them. The alternative is to duplicate and maintain code both on AFR and EC
>>>>> (and not sure if even in DHT depending on how we want to handle some
>>>>> cases).
>>>>>
>>>>
>>>> Didn't understand the solution. I wanted to highlight that client
>>>> xlator by itself can't make a decision about healing locks because it
>>>> doesn't know what happened on other replicas. If we have replica-3 volume
>>>> and all 3 bricks get disconnected to their respective bricks. Now another
>>>> mount process can take a lock on that file modify it and unlock. Now upon
>>>> reconnection, the old mount process which had locks would think it always
>>>> had the lock if client xlator independently tries to heal its own locks
>>>> because file is not closed on it so far. But that is wrong. Let me know if
>>>> it makes sense....
>>>>
>>>
>>> My point of view is that any configuration with these requirements will
>>> have an appropriate quorum value so that it's impossible to have two or
>>> more partitions of the nodes working at the same time. So, under this
>>> assumptions, mount-1 can be in two situations:
>>>
>>> 1. It has lost a single brick and it's still operational. The other
>>> bricks will continue locked and everything should work fine from the point
>>> of view of the application. Any other application trying to get a lock will
>>> fail due to lack of quorum. When the lost brick comes back and is
>>> reconnected, client xlator will still have the fd reference and locks taken
>>> (unless the application has released the lock or closed the fd, in which
>>> case client xlator should get notified and clear that information), so it
>>> should be able to recover the previous state.
>>>
>>> 2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and
>>> any operation going to that file should fail with EIO. AFR should send a
>>> special request to client xlator so that it forgets any fd's and locks for
>>> that file. If bricks reconnect after that, no fd reopen or lock recovery
>>> will happen. Eventually the application should close the fd and retry
>>> later. This may succeed to not, depending on whether mount-2 has taken the
>>> lock already or not.
>>>
>>> So, it's true that client xlator doesn't know the state of the other
>>> bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum
>>> and updates client xlator when quorum is lost.
>>>
>>
>> Just curious. Is there any reason why you think delegating the actual
>> responsibility of re-opening or forgetting the locks to protocol/client is
>> better when compared to AFR/EC doing the actual work of re-opening files
>> and reacquiring locks? Asking this because, in the case of plain
>> distribute, DHT will also have to indicate Quorum loss on every disconnect
>> (as Quorum consisted of just 1 brick).
>>
>
> The basic reason is that doing that on AFR and EC requires code
> duplication. The code is not expected to be simple either, so it can
> contain bugs or it could require improvements eventually. Every time we
> want to do a change, we should fix both AFR and EC, but this has not
> happened in many cases in the past on features that are already duplicated
> in AFR and EC, so it's quite unlikely that this will happen in the future.
>

That's a good reason. +1.

>
> Regarding the requirement of sending a quorum loss notification from DHT,
> I agree it's a new thing, but it's way simpler to do than the fd and lock
> heal logic.
>
> Xavi
>
>
>> From what I understand, the design is the same one which me, Pranith,
>> Anoop and Vijay had discussed (in essence) but  varies in implementation
>> details.
>>
>>
>>> I haven't worked out all the details of this approach, but I think it
>>> should work and it's simpler to maintain than trying to do the same for AFR
>>> and EC.
>>>
>>> Xavi
>>>
>>>
>>>>
>>>>> A similar thing could be done for open fd, since the current solution
>>>>> duplicates code in AFR and EC, but this is another topic...
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>
>>>>>>>> What I proposed is a short term solution. mid to long term solution
>>>>>>>> should be lock healing feature implemented in AFR/EC. In fact I had this
>>>>>>>> conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
>>>>>>>> posting this msg to ML.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> However, this use case is not affected if the application don't
>>>>>>>>>> acquire any POSIX locks. So, I am interested in knowing
>>>>>>>>>> * whether your use cases use POSIX locks?
>>>>>>>>>> * Is it feasible for your application to re-open fds and
>>>>>>>>>> re-acquire locks on seeing EBADFD errors?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think that many applications are not prepared to handle that.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I too suspected that and in fact not too happy with the solution.
>>>>>>>> But went ahead with this mail as I heard implementing lock-heal  in AFR
>>>>>>>> will take time and hence there are no alternative short term solutions.
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>>>>>>>>
>>>>>>>>>> regards,
>>>>>>>>>> Raghavendra
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pranith
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Pranith
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190328/4ef92118/attachment.html>