[Gluster-users] POSIX locks and disconnections between clients and bricks

Pranith Kumar Karampuri pkarampu at redhat.com
Wed Mar 27 12:13:06 UTC 2019


On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez <jahernan at redhat.com> wrote:

> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at redhat.com>
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowdapp at redhat.com> wrote:
>>>
>>>> All,
>>>>
>>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>>> through which those locks are held disconnects from bricks/server. This
>>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>>> application unlocks while the connection was still down). However, this
>>>> means the lock is no longer exclusive as other applications/clients can
>>>> acquire the same lock. To communicate that locks are no longer valid, we
>>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>>> that any future operations on that fd will fail, forcing the application to
>>>> re-open the fd and re-acquire locks it needs [1].
>>>>
>>>
>>> Wouldn't it be better to retake the locks when the brick is reconnected
>>> if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's the
>> primary reason why bricks assume the worst (client will not reconnect) and
>> cleanup the locks.
>>
>
> True, so it's fine to cleanup the locks. I'm not saying that locks
> shouldn't be released on disconnect. The assumption is that if the client
> has really died, it will also disconnect from other bricks, who will
> release the locks. So, eventually, another client will have enough quorum
> to attempt a lock that will succeed. In other words, if a client gets
> disconnected from too many bricks simultaneously (loses Quorum), then that
> client can be considered as bad and can return errors to the application.
> This should also cause to release the locks on the remaining connected
> bricks.
>
> On the other hand, if the disconnection is very short and the client has
> not died, it will keep enough locked files (it has quorum) to avoid other
> clients to successfully acquire a lock. In this case, if the brick is
> reconnected, all existing locks should be reacquired to recover the
> original state before the disconnection.
>
>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a bug
>> upstream too.
>>
>>
>>>
>>>
>>>>
>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>> application as long as Quorum number of children "never ever" lost
>>>> connection with bricks after locks have been acquired. I am using the term
>>>> "never ever" as locks are not healed back after re-connection and hence
>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>> after re-connection happens. So, its not just Quorum number of children
>>>> "currently online", but Quorum number of children "never having
>>>> disconnected with bricks after locks are acquired".
>>>>
>>>
>>> I think this requisite is not feasible. In a distributed file system,
>>> sooner or later all bricks will be disconnected. It could be because of
>>> failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications open
>>> and close files frequently enough (i.e. the fd is not kept open more time
>>> than it takes to have more than Quorum bricks disconnected) then there's no
>>> problem. The problem can only appear on applications that open files for a
>>> long time and also use posix locks. In this case, the only good solution I
>>> see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC as
>> only they know whether there are enough online bricks to have prevented any
>> conflicting lock. Protocol/client itself doesn't have enough information to
>> do that. If its a plain distribute, I don't see a way to heal locks without
>> loosing the property of exclusivity of locks.
>>
>
> Lock-healing of locks acquired while a brick was disconnected need to be
> handled by AFR/EC. However, locks already present at the moment of
> disconnection could be recovered by client xlator itself as long as the
> file has not been closed (which client xlator already knows).
>

What if another client (say mount-2) took locks at the time of disconnect
from mount-1 and modified the file and unlocked? client xlator doing the
heal may not be a good idea.


>
> Xavi
>
>
>> What I proposed is a short term solution. mid to long term solution
>> should be lock healing feature implemented in AFR/EC. In fact I had this
>> conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
>> posting this msg to ML.
>>
>>
>>>
>>>> However, this use case is not affected if the application don't acquire
>>>> any POSIX locks. So, I am interested in knowing
>>>> * whether your use cases use POSIX locks?
>>>> * Is it feasible for your application to re-open fds and re-acquire
>>>> locks on seeing EBADFD errors?
>>>>
>>>
>>> I think that many applications are not prepared to handle that.
>>>
>>
>> I too suspected that and in fact not too happy with the solution. But
>> went ahead with this mail as I heard implementing lock-heal  in AFR will
>> take time and hence there are no alternative short term solutions.
>>
>
>>
>>> Xavi
>>>
>>>
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/466cfb14/attachment.html>


More information about the Gluster-users mailing list