[Gluster-devel] Lock migration as a part of rebalance

Tue Apr 28 02:13:27 UTC 2015

On 01/03/2015 12:37 AM, Shyam wrote:
> On 12/17/2014 02:15 AM, Raghavendra G wrote:
>>
>>
>> On Wed, Dec 17, 2014 at 1:25 AM, Shyam <srangana at redhat.com
>> <mailto:srangana at redhat.com>> wrote:
>>
>>     This mail intends to present the lock migration across subvolumes
>>     problem and seek solutions/thoughts around the same, so any
>>     feedback/corrections are appreciated.
>>
>>     # Current state of file locks post file migration during rebalance
>>     Currently when a file is migrated during rebalance, its lock
>>     information is not transferred over from the old subvol to the new
>>     subvol, that the file now resides on.
>>
>>     As further lock requests, post migration of the file, would now be
>>     sent to the new subvol, any potential lock conflicts would not be
>>     detected, until the locks are migrated over.
>>
>>     The term locks above can refer to the POSIX locks aquired using the
>>     FOP lk by consumers of the volume, or to the gluster internal(?)
>>     inode/dentry locks. For now we limit the discussion to the POSIX
>>     locks supported by the FOP lk.
>>
>>     # Other areas in gluster that migrate locks
>>     Current scheme of migrating locks in gluster on graph switches,
>>     trigger an fd migration process that migrates the lock information
>>     from the old fd to the new fd. This is driven by the gluster client
>>     stack, protocol layer (FUSE, gfapi).
>>
>>     This is done using the (set/get)xattr call with the attr name,
>>     "trusted.glusterfs.lockinfo". Which in turn fetches the required key
>>     for the old fd, and migrates the lock from this old fd to new fd.
>>     IOW, there is very little information transferred as the locks are
>>     migrated across fds on the same subvolume and not across subvolumes.
>>
>>     Additionally locks that are in the blocked state, do not seem to be
>>     migrated (at least the function to do so in FUSE is empty
>>     (fuse_handle_blocked_locks), need to run a test case to confirm), or
>>     responded to with an error.
>>
>>     # High level solution requirements when migrating locks across
>> subvols
>>     1) Block/deny new lock acquisitions on the new subvol, till locks
>>     are migrated
>>        - So that new locks that have overlapping ranges to the older
>>     ones are not granted
>>        - Potentially return EINTR on such requests?
>>     2) Ensure all _acquired_ locks from all clients are migrated first
>>        - So that if and when placing blocked lock requests, these really
>>     do block for previous reasons and are not granted now
>>     3) Migrate blocked locks post acquired locks are migrated (in any
>>     order?)
>>          - OR, send back EINTR for the blocked locks
>>
>>     (When we have upcalls/delegations added as features, those would
>>     have similar requirements for migration across subvolumes)
>>
>>     # Potential processes that could migrate the locks and issues thereof
>>     1) The rebalance process, that migrates the file can help with
>>     migrating the locks, which would not involve any clients to the
>>     gluster volume
>>
>>     Issues:
>>         - Lock information is fd specific, when migrating these locks,
>>     the clients need not have detected that the file is migrated, and
>>     hence opened an fd against the new subvol, which when missing, would
>>     make this form of migration a little more interesting
>>         - Lock information also has client connection specific pointer
>>     (client_t) that needs to be reassigned on the new subvol
>>         - Other subvol specific information, maintained in the lock,
>>     that needs to be migrated over will suffer the same
>>     limitations/solutions
>>
>>
>> The tricky thing here is that rebalance process has no control over when
>> 1. fd will be opened on dst-node, since clients open fd on dst-node
>> on-demand based on the I/O happening through them.
>
> (Read ** below, remaining thoughts/responses are more based on iff we
> could identify clients across nodes)
>
> We should _maybe_ have dangling fds opened on the dst-node, which can be
> mapped to the incoming requests from the clients (whenever they come).
> In case they do not, we still have the current problem that the fd on
> the src-node is leaked (or held till a client request comes its way).
>
> A lock migration, should migrate the fd and its associated information,
> and leave it dangling, till the client tries to establish the same fd
> (i.e via DHT xlator on the client). Thoughts?
>
>> 2. client establishes connection on dst-node (client might've been cut
>> off from dst-node).
>
> First part would be, if we have a _static_ client mapping, if we do,
> then client need not be connected when we migrate the file, and leave a
> dangling_fd at the destination. In case clients do not have this, then
> we can deny file migration as we are unable to map out the client
> relation on the other end. Would that be reasonable?
>
> Also, on reconnects do clients get different identification information?
>
yes on reconnects, clients connect with a different ID.

>>
>> Unless we've a global mapping (like a client can always be identified
>> using same uuid irrespective of the brick we are looking) this seems
>> like a difficult thing to achieve.
>
> (**) Do we have any such mapping at present? Meaning, if a client is
> connected to src and dst subvolumes, then would it have the same
> UUID/connection information? Or, _any_ way to identify with certainty,
> they are the same client?
>
> In case the client is not connected to the dst-node, is there any way to
> identify the client as being the same as the one connected to the
> src-node, when it connects later to the dst-node?
>
Unless server can parse the client_uid, IMO it may not be straight 
forward. However from the below code snippet -

 >>>>>>>
        /* When lock-heal is enabled:
          * With multiple graphs possible in the same process, we need a
            field to bring the uniqueness. Graph-ID should be enough to 
get the
            job done.
          * When lock-heal is disabled, connection-id should always be 
unique so
          * that server never gets to reuse the previous connection 
resources
          * so it cleans up the resources on every disconnect. Otherwise
          * it may lead to stale resources, i.e. leaked file desciptors,
          * inode/entry locks
         */
         if (!conf->lk_heal) {
                 snprintf (counter_str, sizeof (counter_str),
                           "-%"PRIu64, conf->setvol_count);
                 conf->setvol_count++;
         }
         ret = gf_asprintf (&process_uuid_xl, "%s-%s-%d%s",
                            this->ctx->process_uuid, this->name,
                            this->graph->id, counter_str);
         if (-1 == ret) {
                 gf_log (this->name, GF_LOG_ERROR,
                         "asprintf failed while setting process_uuid");
                 goto fail;
         }
<<<<<<<<<

Looks like there was an attempt to make it uniform in case of lk_heal 
(which is disabled atm). We may need to enable it and check/fix the 
issues, which may help us in migrating locks across the bricks as well 
as you have proposed along with self-heal of locks.

Thanks,
Soumya

>>
>>
>>     Benefits:
>>         - Can lock out/block newer lock requests effectively
>>         - Need not _wait_ till all clients have registered that the file
>>     is under migration and/or migrated their locks
>>
>>     2) DHT xlator in each client could be held responsible to migrate
>>     its locks to the new subvolume
>>
>>     Issues:
>>         - Somehow need to let every client know that locks need to be
>>     migrated (upcall infrastructure?)
>>         - What if some client is not reachable at the given time?
>>         - Have to wait till all clients replay the locks
>>
>>     Benefits:
>>         - Hmmm... Nothing really, if we could do it by the rebalance
>>     process itself the solution maybe better.
>>
>>     # Overall thoughts
>>     - We could/should return EINTR for blocked locks, in the case of a
>>     graph switch, and the case of a file migration, this would relieve
>>     the design of that particular complexity, and is a legal error to
>>     return from a flock/fcntl operation
>>
>>     - If we can extract and map out all relevant lock information across
>>     subvolumes, then having rebalance do this work seems like a good
>>     fit. Additionally this could serve as a good way to migrate upcall
>>     requests and state as well
>
> Adding a further note here, we could deny migration of a file in case we
> are unable to map out all the relevant lock information. That way some
> files would not be migrated due to inability to migrate all relevant
> information regarding the same.
>
>>
>>     Thoughts?
>>
>>     Shyam
>>     _________________________________________________
>>     Gluster-devel mailing list
>>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>     http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
>>     <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
>>
>>
>>
>> --
>> Raghavendra G
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel