[Gluster-users] POSIX locks and disconnections between clients and bricks

Wed Mar 27 18:24:57 UTC 2019

On Wed, 27 Mar 2019, 18:26 Pranith Kumar Karampuri, <pkarampu at redhat.com>
wrote:

>
>
> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez <jahernan at redhat.com>
> wrote:
>
>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez <jahernan at redhat.com>
>>> wrote:
>>>
>>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>>> pkarampu at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jahernan at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Raghavendra,
>>>>>>>>
>>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> All,
>>>>>>>>>
>>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the
>>>>>>>>> client/mount through which those locks are held disconnects from
>>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock problem
>>>>>>>>> later (For eg., if application unlocks while the connection was still
>>>>>>>>> down). However, this means the lock is no longer exclusive as other
>>>>>>>>> applications/clients can acquire the same lock. To communicate that locks
>>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX locks)
>>>>>>>>> bad on a disconnect so that any future operations on that fd will fail,
>>>>>>>>> forcing the application to re-open the fd and re-acquire locks it needs [1].
>>>>>>>>>
>>>>>>>>
>>>>>>>> Wouldn't it be better to retake the locks when the brick is
>>>>>>>> reconnected if the lock is still in use ?
>>>>>>>>
>>>>>>>
>>>>>>> There is also  a possibility that clients may never reconnect.
>>>>>>> That's the primary reason why bricks assume the worst (client will not
>>>>>>> reconnect) and cleanup the locks.
>>>>>>>
>>>>>>
>>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>>>>> shouldn't be released on disconnect. The assumption is that if the client
>>>>>> has really died, it will also disconnect from other bricks, who will
>>>>>> release the locks. So, eventually, another client will have enough quorum
>>>>>> to attempt a lock that will succeed. In other words, if a client gets
>>>>>> disconnected from too many bricks simultaneously (loses Quorum), then that
>>>>>> client can be considered as bad and can return errors to the application.
>>>>>> This should also cause to release the locks on the remaining connected
>>>>>> bricks.
>>>>>>
>>>>>> On the other hand, if the disconnection is very short and the client
>>>>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>>>>> other clients to successfully acquire a lock. In this case, if the brick is
>>>>>> reconnected, all existing locks should be reacquired to recover the
>>>>>> original state before the disconnection.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> BTW, the referenced bug is not public. Should we open another bug
>>>>>>>> to track this ?
>>>>>>>>
>>>>>>>
>>>>>>> I've just opened up the comment to give enough context. I'll open a
>>>>>>> bug upstream too.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>>>>>>> application as long as Quorum number of children "never ever" lost
>>>>>>>>> connection with bricks after locks have been acquired. I am using the term
>>>>>>>>> "never ever" as locks are not healed back after re-connection and hence
>>>>>>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>>>>>>> after re-connection happens. So, its not just Quorum number of children
>>>>>>>>> "currently online", but Quorum number of children "never having
>>>>>>>>> disconnected with bricks after locks are acquired".
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this requisite is not feasible. In a distributed file
>>>>>>>> system, sooner or later all bricks will be disconnected. It could be
>>>>>>>> because of failures or because an upgrade is done, but it will happen.
>>>>>>>>
>>>>>>>> The difference here is how long are fd's kept open. If applications
>>>>>>>> open and close files frequently enough (i.e. the fd is not kept open more
>>>>>>>> time than it takes to have more than Quorum bricks disconnected) then
>>>>>>>> there's no problem. The problem can only appear on applications that open
>>>>>>>> files for a long time and also use posix locks. In this case, the only good
>>>>>>>> solution I see is to retake the locks on brick reconnection.
>>>>>>>>
>>>>>>>
>>>>>>> Agree. But lock-healing should be done only by HA layers like AFR/EC
>>>>>>> as only they know whether there are enough online bricks to have prevented
>>>>>>> any conflicting lock. Protocol/client itself doesn't have enough
>>>>>>> information to do that. If its a plain distribute, I don't see a way to
>>>>>>> heal locks without loosing the property of exclusivity of locks.
>>>>>>>
>>>>>>
>>>>>> Lock-healing of locks acquired while a brick was disconnected need to
>>>>>> be handled by AFR/EC. However, locks already present at the moment of
>>>>>> disconnection could be recovered by client xlator itself as long as the
>>>>>> file has not been closed (which client xlator already knows).
>>>>>>
>>>>>
>>>>> What if another client (say mount-2) took locks at the time of
>>>>> disconnect from mount-1 and modified the file and unlocked? client xlator
>>>>> doing the heal may not be a good idea.
>>>>>
>>>>
>>>> To avoid that we should ensure that any lock/unlocks are sent to the
>>>> client, even if we know it's disconnected, so that client xlator can track
>>>> them. The alternative is to duplicate and maintain code both on AFR and EC
>>>> (and not sure if even in DHT depending on how we want to handle some
>>>> cases).
>>>>
>>>
>>> Didn't understand the solution. I wanted to highlight that client xlator
>>> by itself can't make a decision about healing locks because it doesn't know
>>> what happened on other replicas. If we have replica-3 volume and all 3
>>> bricks get disconnected to their respective bricks. Now another mount
>>> process can take a lock on that file modify it and unlock. Now upon
>>> reconnection, the old mount process which had locks would think it always
>>> had the lock if client xlator independently tries to heal its own locks
>>> because file is not closed on it so far. But that is wrong. Let me know if
>>> it makes sense....
>>>
>>
>> My point of view is that any configuration with these requirements will
>> have an appropriate quorum value so that it's impossible to have two or
>> more partitions of the nodes working at the same time. So, under this
>> assumptions, mount-1 can be in two situations:
>>
>> 1. It has lost a single brick and it's still operational. The other
>> bricks will continue locked and everything should work fine from the point
>> of view of the application. Any other application trying to get a lock will
>> fail due to lack of quorum. When the lost brick comes back and is
>> reconnected, client xlator will still have the fd reference and locks taken
>> (unless the application has released the lock or closed the fd, in which
>> case client xlator should get notified and clear that information), so it
>> should be able to recover the previous state.
>>
>
> Application could be in blocked state as well if it tries to get blocking
> lock. So as soon as a disconnect happens, the lock will be granted on that
> brick to one of the blocked locks. On the other two bricks it would still
> be blocked. Trying to heal that will require a new operation that is not
> already present in locks code, which should be able to tell client as well
> about either changing the lock state to blocked on that brick or to retry
> lock operation.
>

Yes, but this problem exists even if the lock-heal is done by AFR/EC. This
is something that needs to be solved anyway, but it's independent of who
does the lock-heal.

>
>>
>> 2. It has lost 2 or 3 bricks. In this case mount-1 has lost quorum and
>> any operation going to that file should fail with EIO. AFR should send a
>> special request to client xlator so that it forgets any fd's and locks for
>> that file. If bricks reconnect after that, no fd reopen or lock recovery
>> will happen. Eventually the application should close the fd and retry
>> later. This may succeed to not, depending on whether mount-2 has taken the
>> lock already or not.
>>
>> So, it's true that client xlator doesn't know the state of the other
>> bricks, but it doesn't need to as long as AFR/EC strictly enforces quorum
>> and updates client xlator when quorum is lost.
>>
>
> This part seems good.
>
>
>>
>> I haven't worked out all the details of this approach, but I think it
>> should work and it's simpler to maintain than trying to do the same for AFR
>> and EC.
>>
>
> Let us spend some time on this on #gluster-dev when you get some time
> tomorrow to figure out the complete solution which handles the corner cases
> too.
>
>
>>
>> Xavi
>>
>>
>>>
>>>> A similar thing could be done for open fd, since the current solution
>>>> duplicates code in AFR and EC, but this is another topic...
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>
>>>>>>> What I proposed is a short term solution. mid to long term solution
>>>>>>> should be lock healing feature implemented in AFR/EC. In fact I had this
>>>>>>> conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
>>>>>>> posting this msg to ML.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> However, this use case is not affected if the application don't
>>>>>>>>> acquire any POSIX locks. So, I am interested in knowing
>>>>>>>>> * whether your use cases use POSIX locks?
>>>>>>>>> * Is it feasible for your application to re-open fds and
>>>>>>>>> re-acquire locks on seeing EBADFD errors?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think that many applications are not prepared to handle that.
>>>>>>>>
>>>>>>>
>>>>>>> I too suspected that and in fact not too happy with the solution.
>>>>>>> But went ahead with this mail as I heard implementing lock-heal  in AFR
>>>>>>> will take time and hence there are no alternative short term solutions.
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Raghavendra
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>
>>>
>>> --
>>> Pranith
>>>
>>
>
> --
> Pranith
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/935ba9af/attachment.html>