[Gluster-devel] Lock migration as a part of rebalance

Fri Sep 4 10:03:12 UTC 2015

----- Original Message -----
From: "Raghavendra G" <raghavendra at gluster.com>
To: "Shyam" <srangana at redhat.com>
Cc: "Gluster Devel" <gluster-devel at gluster.org>
Sent: Thursday, 2 July, 2015 10:21:38 AM
Subject: Re: [Gluster-devel] Lock migration as a part of rebalance

One solution I can think of is to have the responsibility of lock migration process spread between both client and rebalance process. A rough algo is outlined below: 

1. We should've a static identifier for client process (something like process-uuid of mount process - lets call it client-uuid) in the lock structure. This identifier won't change across reconnects. 
2. rebalance just copies the entire lock-state verbatim to dst-node (fd-number and client-uuid values would be same as the values on src-node). 
3. rebalance process marks these half-migrated locks as "migration-in-progress" on dst-node. Any lock request which overlaps with "migration-in-progress" locks is considered as conflicting and dealt with appropriately (if SETLK unwind with EAGAIN and if SETLKW block till these locks are released). Same approach is followed for mandatory locking too. 

4. whenever an fd based operation (like writev, release, lk, flush etc) happens on the fd, the client (through which lock was acquired), "migrates" the lock. Migration is basically, 
* does a fgetxattr (fd, LOCKINFO_KEY, src-subvol). This will fetch the fd number on src subvol - lockinfo. 
* opens new fd on dst-subvol. Then does fsetxattr (new-fd, LOCKINFO_KEY, lockinfo, dst-subvol). The brick on receiving setxattr on virtual xattr LOCKINFO_KEY looks for all the locks with ((fd == lockinfo) && (client-uuid == uuid-of-client-on-which-this-setxattr-came)) and then fills in appropriate values for client_t and fd (basically sets lock->fd = fd-num-of-the-fd-on-which-setxattr-came). 

Some issues and solutions: 

1. What if client never connects to dst brick? 

We'll have a time-out for "migration-in-progress" locks to be converted into "complete" locks. If DHT doesn't migrate within this timeout, server will cleanup these locks. This is similar to current protocol/client implementation of lock-heal (This functionality is disabled on client as of now. But, upcall needs this feature too and we can get this functionality working). If a dht tries to migrate the locks after this timeout, it'll will have to re-aquire lock on destination (This has to be a non-blocking lock request, irrespective of mode of original lock). We get information of current locks opened through the fd opened on src. If lock acquisition fails for some reason, dht marks the fd bad, so that application will be notified about lost locks. One problem unsolved with this solution is another client (say c2) acquiring and releasing the lock during the period starting from timeout and client (c1) initiates lock migration. However, that problem is present even with existing lock implementation and not really something new introduced by lock migration. 

2. What if client connects but disconnects before it could've attempted to migrate "migration-in-progress" locks? 

The server can identify locks belonging to this client using client-uuid and cleans them up. Dht trying to migrate locks after first disconnect will try to reaquire locks as outlined in 1. 

3. What if client disconnects with src subvol and cannot get lock information from src for handling issues 1 and 2? 

We'll mark the fd bad. We can optimize this to mark fd bad only if locks have been acquired. To do this client has to store some history in the fd on successful lock acquisition. 

regards, 

On Wed, Dec 17, 2014 at 12:45 PM, Raghavendra G < raghavendra at gluster.com > wrote: 

On Wed, Dec 17, 2014 at 1:25 AM, Shyam < srangana at redhat.com > wrote: 

This mail intends to present the lock migration across subvolumes problem and seek solutions/thoughts around the same, so any feedback/corrections are appreciated. 

# Current state of file locks post file migration during rebalance 
Currently when a file is migrated during rebalance, its lock information is not transferred over from the old subvol to the new subvol, that the file now resides on. 

As further lock requests, post migration of the file, would now be sent to the new subvol, any potential lock conflicts would not be detected, until the locks are migrated over. 

The term locks above can refer to the POSIX locks aquired using the FOP lk by consumers of the volume, or to the gluster internal(?) inode/dentry locks. For now we limit the discussion to the POSIX locks supported by the FOP lk. 

# Other areas in gluster that migrate locks 
Current scheme of migrating locks in gluster on graph switches, trigger an fd migration process that migrates the lock information from the old fd to the new fd. This is driven by the gluster client stack, protocol layer (FUSE, gfapi). 

This is done using the (set/get)xattr call with the attr name, "trusted.glusterfs.lockinfo". Which in turn fetches the required key for the old fd, and migrates the lock from this old fd to new fd. IOW, there is very little information transferred as the locks are migrated across fds on the same subvolume and not across subvolumes. 

Additionally locks that are in the blocked state, do not seem to be migrated (at least the function to do so in FUSE is empty (fuse_handle_blocked_locks), need to run a test case to confirm), or responded to with an error. 

# High level solution requirements when migrating locks across subvols 
1) Block/deny new lock acquisitions on the new subvol, till locks are migrated 
- So that new locks that have overlapping ranges to the older ones are not granted 
- Potentially return EINTR on such requests? 
2) Ensure all _acquired_ locks from all clients are migrated first 
- So that if and when placing blocked lock requests, these really do block for previous reasons and are not granted now 
3) Migrate blocked locks post acquired locks are migrated (in any order?) 
- OR, send back EINTR for the blocked locks 

(When we have upcalls/delegations added as features, those would have similar requirements for migration across subvolumes) 

# Potential processes that could migrate the locks and issues thereof 
1) The rebalance process, that migrates the file can help with migrating the locks, which would not involve any clients to the gluster volume 

Issues: 
- Lock information is fd specific, when migrating these locks, the clients need not have detected that the file is migrated, and hence opened an fd against the new subvol, which when missing, would make this form of migration a little more interesting 
- Lock information also has client connection specific pointer (client_t) that needs to be reassigned on the new subvol 
- Other subvol specific information, maintained in the lock, that needs to be migrated over will suffer the same limitations/solutions 

The tricky thing here is that rebalance process has no control over when 
1. fd will be opened on dst-node, since clients open fd on dst-node on-demand based on the I/O happening through them. 

2. client establishes connection on dst-node (client might've been cut off from dst-node). 

Unless we've a global mapping (like a client can always be identified using same uuid irrespective of the brick we are looking) this seems like a difficult thing to achieve. 

Benefits: 
- Can lock out/block newer lock requests effectively 
- Need not _wait_ till all clients have registered that the file is under migration and/or migrated their locks 

2) DHT xlator in each client could be held responsible to migrate its locks to the new subvolume 

Issues: 
- Somehow need to let every client know that locks need to be migrated (upcall infrastructure?) 
- What if some client is not reachable at the given time? 
- Have to wait till all clients replay the locks 

Benefits: 
- Hmmm... Nothing really, if we could do it by the rebalance process itself the solution maybe better. 

# Overall thoughts 
- We could/should return EINTR for blocked locks, in the case of a graph switch, and the case of a file migration, this would relieve the design of that particular complexity, and is a legal error to return from a flock/fcntl operation 

- If we can extract and map out all relevant lock information across subvolumes, then having rebalance do this work seems like a good fit. Additionally this could serve as a good way to migrate upcall requests and state as well 

Thoughts? 

Shyam 
______________________________ _________________ 
Gluster-devel mailing list 
Gluster-devel at gluster.org 
http://supercolony.gluster. org/mailman/listinfo/gluster- devel 

-- 

Raghavendra G 

-- 

Raghavendra G 

_______________________________________________
Gluster-devel mailing list
Gluster-devel at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel