[Gluster-devel] Replace cluster wide gluster locks with volume wide locks

Thu Sep 12 19:00:08 UTC 2013

Hi,

After having further discussions, we revisited the requirements and it looks possible to further improve them, as well
as the design.

1. We classify all gluster operations in three different classes : Create volume, Delete volume, and volume specific
    operations.
2. At any given point of time, we should allow two simultaneous operations (create, delete or volume specific), as long
    as each both the operations are not happening on the same volume.
3. If two simultaneous operations are performed on the same volume, the operation which manages to acquire the volume
    lock will succeed, while the other will fail.

In order to achieve this, we propose a locking engine, which will receive lock requests from these three types of
operations. Each such request for a particular volume will contest for the same volume lock (based on the volume name
and the node-uuid). For example, a delete volume command for volume1 and a volume status command for volume 1 will
contest for the same lock (comprising of the volume name, and the uuid of the node winning the lock), in which case,
one of these commands will succeed and the other one will not, failing to acquire the lock.

Whereas, if two operations are simultaneously performed on a different volumes they should happen smoothly, as both
these operations would request the locking engine for two different locks, and will succeed in locking them in parallel.

Regards,
Avra

On 09/12/2013 01:18 PM, Kaushal M wrote:
> On Thu, Sep 12, 2013 at 12:10 PM, Varun Shastry <vshastry at redhat.com> wrote:
>> On Thursday 12 September 2013 08:41 AM, Krishnan Parthasarathi wrote:
>>>
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>> As of today most gluster commands take a cluster wide lock, before
>>>> performing
>>>> their respective operations. As a result
>>>> any two gluster commands, which have no interdependency with each other,
>>>> can't be executed simultaneously. To
>>>> remove this interdependency we propose to replace this cluster wide lock
>>>> with
>>>> a volume specific lock, so that two
>>>> operations on two different volumes can be performed simultaneously.
>>>>
>>>> By implementing this volume wide lock, our agenda is to achieve the
>>>> following:
>>>> 1. Allowing simultaneous volume operations on different volumes.
>>>> Performing
>>>> two operations simultaneously on the same
>>>>      volume should not be allowed.
>>>> 2. While a volume is being created, or deleted, operations(like
>>>> rebalance,
>>>> geo-rep) should be permitted on other volumes.
>>> You can't meaningfully interleave a volume-delete and a volume-rebalance
>>> on the
>>> same volume.
>>> The locking domains must be arranged such that, create/delete operations
>>> take a lock which conflict all locks held on any volume. This would
>>> handle the scenario mentioned above.
>>>
>>> The volume operations performed in the cluster can be classified into the
>>> following
>>> categories,
>>> - Create/Delete - Need a cluster-wide lock. This locks 'trumps' all other
>>> locks
>>>
>>> - Add-brick, remove-brick, replace-brick, rebalance, volume-set/reset and
>>> a whole bunch of features
>>>     associated with a volume like quota, geo-rep etc - Need a volume-level
>>> lock.
>> Since these operations (add brick, remove brick, set-reset) modify the nfs
>> volfile, a single volfile for all the volumes, don't we need to take the
>> cluster-wide lock for the above set of operations also?
>>
> We did discuss this when we were discussing the design before it was
> posted to the list and we don't think it'll be a problem. As the
> generation of these volfiles take all the volumes into consideration,
> the last operation, among concurrent volume operations, to generate
> these volfiles will generate the correct volfile. If another operation
> which requires volfiles regeneration happens when these are being
> generated, it won't be a problem either as the operation will
> regenrate the volfiles.
>
>> - Varun Shastry
>>
>>> - Volume-info, volume-status - Need no locking.
>>>
>>>
>>>> 3. Two simultaneous volume create or volume delete operations should not
>>>> be
>>>> permitted.
>>>>
>>>> We propose to do this in two steps:
>>>>
>>>> 1. Implementing the volume wide lock: In order to implement this, we will
>>>> add
>>>> a lock consisting of the uuid of the
>>>>      originator, to the in-memory volinfo(of that particular volume), in
>>>> all
>>>>      the nodes of the cluster. Once this lock is
>>>>      taken, any other command for the same volume, will not be able to
>>>> acquire
>>>>      this lock from that particular volume's
>>>>      volinfo. Meanwhile other operations on other volumes can still be
>>>>      executed.
>>> There is one caveat with storing the volume-level lock in volinfo object.
>>> All the nodes are not guaranteed to have an up-to-date version of the
>>> volinfo
>>> object. We don't know have the necessary mechanism to select the peers
>>> based
>>> on their recency of volume information. Worse still, the volinfo object
>>> could be
>>> modified on incoming updates from other peers in the cluster,
>>> if and when they rejoin (the available partition of) the cluster, in the
>>> middle of
>>> a transaction. I agree this lack of guarantee is part of a different
>>> problem,
>>> but this is a runtime reality :(
>>>
>>> This prompts me to think that we should keep all the lock related
>>> book-keeping
>>> independent of things that are not guaranteed to stick around, for the
>>> lifetime
>>> of a command. This way we can keep the policy of locking (ie. volume-wide
>>> Vs cluster-wide)
>>> separate from the mechanism.
>>>
>>> cheers,
>>> krish
>>>
>>>> 2. Stop using the cluster lock for existing commands: Port existing
>>>> commands
>>>> to use this framework. We will use
>>>>      op-version to take care of backward compatibility for the existing
>>>>      commands. We need to take care of commands like
>>>>      volume create, volume delete, rebalance callbacks, implicit volume
>>>> syncs
>>>>      (when a node comes up), the volume sync
>>>>      command which modify the priv->volumes, and also other non-volume
>>>>      operations which work inside the gambit of the
>>>>      cluster locks today while implementing this.
>>>>      Please feel free to provide feedback.
>>>>
>>>> Regards,
>>>> Avra
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel