[Gluster-devel] Lock migration as a part of rebalance
Soumya Koduri
skoduri at redhat.com
Tue Apr 28 02:13:27 UTC 2015
On 01/03/2015 12:37 AM, Shyam wrote:
> On 12/17/2014 02:15 AM, Raghavendra G wrote:
>>
>>
>> On Wed, Dec 17, 2014 at 1:25 AM, Shyam <srangana at redhat.com
>> <mailto:srangana at redhat.com>> wrote:
>>
>> This mail intends to present the lock migration across subvolumes
>> problem and seek solutions/thoughts around the same, so any
>> feedback/corrections are appreciated.
>>
>> # Current state of file locks post file migration during rebalance
>> Currently when a file is migrated during rebalance, its lock
>> information is not transferred over from the old subvol to the new
>> subvol, that the file now resides on.
>>
>> As further lock requests, post migration of the file, would now be
>> sent to the new subvol, any potential lock conflicts would not be
>> detected, until the locks are migrated over.
>>
>> The term locks above can refer to the POSIX locks aquired using the
>> FOP lk by consumers of the volume, or to the gluster internal(?)
>> inode/dentry locks. For now we limit the discussion to the POSIX
>> locks supported by the FOP lk.
>>
>> # Other areas in gluster that migrate locks
>> Current scheme of migrating locks in gluster on graph switches,
>> trigger an fd migration process that migrates the lock information
>> from the old fd to the new fd. This is driven by the gluster client
>> stack, protocol layer (FUSE, gfapi).
>>
>> This is done using the (set/get)xattr call with the attr name,
>> "trusted.glusterfs.lockinfo". Which in turn fetches the required key
>> for the old fd, and migrates the lock from this old fd to new fd.
>> IOW, there is very little information transferred as the locks are
>> migrated across fds on the same subvolume and not across subvolumes.
>>
>> Additionally locks that are in the blocked state, do not seem to be
>> migrated (at least the function to do so in FUSE is empty
>> (fuse_handle_blocked_locks), need to run a test case to confirm), or
>> responded to with an error.
>>
>> # High level solution requirements when migrating locks across
>> subvols
>> 1) Block/deny new lock acquisitions on the new subvol, till locks
>> are migrated
>> - So that new locks that have overlapping ranges to the older
>> ones are not granted
>> - Potentially return EINTR on such requests?
>> 2) Ensure all _acquired_ locks from all clients are migrated first
>> - So that if and when placing blocked lock requests, these really
>> do block for previous reasons and are not granted now
>> 3) Migrate blocked locks post acquired locks are migrated (in any
>> order?)
>> - OR, send back EINTR for the blocked locks
>>
>> (When we have upcalls/delegations added as features, those would
>> have similar requirements for migration across subvolumes)
>>
>> # Potential processes that could migrate the locks and issues thereof
>> 1) The rebalance process, that migrates the file can help with
>> migrating the locks, which would not involve any clients to the
>> gluster volume
>>
>> Issues:
>> - Lock information is fd specific, when migrating these locks,
>> the clients need not have detected that the file is migrated, and
>> hence opened an fd against the new subvol, which when missing, would
>> make this form of migration a little more interesting
>> - Lock information also has client connection specific pointer
>> (client_t) that needs to be reassigned on the new subvol
>> - Other subvol specific information, maintained in the lock,
>> that needs to be migrated over will suffer the same
>> limitations/solutions
>>
>>
>> The tricky thing here is that rebalance process has no control over when
>> 1. fd will be opened on dst-node, since clients open fd on dst-node
>> on-demand based on the I/O happening through them.
>
> (Read ** below, remaining thoughts/responses are more based on iff we
> could identify clients across nodes)
>
> We should _maybe_ have dangling fds opened on the dst-node, which can be
> mapped to the incoming requests from the clients (whenever they come).
> In case they do not, we still have the current problem that the fd on
> the src-node is leaked (or held till a client request comes its way).
>
> A lock migration, should migrate the fd and its associated information,
> and leave it dangling, till the client tries to establish the same fd
> (i.e via DHT xlator on the client). Thoughts?
>
>> 2. client establishes connection on dst-node (client might've been cut
>> off from dst-node).
>
> First part would be, if we have a _static_ client mapping, if we do,
> then client need not be connected when we migrate the file, and leave a
> dangling_fd at the destination. In case clients do not have this, then
> we can deny file migration as we are unable to map out the client
> relation on the other end. Would that be reasonable?
>
> Also, on reconnects do clients get different identification information?
>
yes on reconnects, clients connect with a different ID.
>>
>> Unless we've a global mapping (like a client can always be identified
>> using same uuid irrespective of the brick we are looking) this seems
>> like a difficult thing to achieve.
>
> (**) Do we have any such mapping at present? Meaning, if a client is
> connected to src and dst subvolumes, then would it have the same
> UUID/connection information? Or, _any_ way to identify with certainty,
> they are the same client?
>
> In case the client is not connected to the dst-node, is there any way to
> identify the client as being the same as the one connected to the
> src-node, when it connects later to the dst-node?
>
Unless server can parse the client_uid, IMO it may not be straight
forward. However from the below code snippet -
>>>>>>>
/* When lock-heal is enabled:
* With multiple graphs possible in the same process, we need a
field to bring the uniqueness. Graph-ID should be enough to
get the
job done.
* When lock-heal is disabled, connection-id should always be
unique so
* that server never gets to reuse the previous connection
resources
* so it cleans up the resources on every disconnect. Otherwise
* it may lead to stale resources, i.e. leaked file desciptors,
* inode/entry locks
*/
if (!conf->lk_heal) {
snprintf (counter_str, sizeof (counter_str),
"-%"PRIu64, conf->setvol_count);
conf->setvol_count++;
}
ret = gf_asprintf (&process_uuid_xl, "%s-%s-%d%s",
this->ctx->process_uuid, this->name,
this->graph->id, counter_str);
if (-1 == ret) {
gf_log (this->name, GF_LOG_ERROR,
"asprintf failed while setting process_uuid");
goto fail;
}
<<<<<<<<<
Looks like there was an attempt to make it uniform in case of lk_heal
(which is disabled atm). We may need to enable it and check/fix the
issues, which may help us in migrating locks across the bricks as well
as you have proposed along with self-heal of locks.
Thanks,
Soumya
>>
>>
>> Benefits:
>> - Can lock out/block newer lock requests effectively
>> - Need not _wait_ till all clients have registered that the file
>> is under migration and/or migrated their locks
>>
>> 2) DHT xlator in each client could be held responsible to migrate
>> its locks to the new subvolume
>>
>> Issues:
>> - Somehow need to let every client know that locks need to be
>> migrated (upcall infrastructure?)
>> - What if some client is not reachable at the given time?
>> - Have to wait till all clients replay the locks
>>
>> Benefits:
>> - Hmmm... Nothing really, if we could do it by the rebalance
>> process itself the solution maybe better.
>>
>> # Overall thoughts
>> - We could/should return EINTR for blocked locks, in the case of a
>> graph switch, and the case of a file migration, this would relieve
>> the design of that particular complexity, and is a legal error to
>> return from a flock/fcntl operation
>>
>> - If we can extract and map out all relevant lock information across
>> subvolumes, then having rebalance do this work seems like a good
>> fit. Additionally this could serve as a good way to migrate upcall
>> requests and state as well
>
> Adding a further note here, we could deny migration of a file in case we
> are unable to map out all the relevant lock information. That way some
> files would not be migrated due to inability to migrate all relevant
> information regarding the same.
>
>>
>> Thoughts?
>>
>> Shyam
>> _________________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>> http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
>> <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
>>
>>
>>
>> --
>> Raghavendra G
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list