[Gluster-devel] Lock migration as a part of rebalance

Raghavendra G raghavendra at gluster.com
Wed Oct 14 07:11:32 UTC 2015


The original design didn't address some areas (like blocked locks,
atomicity of getlkinfo and setlkinfo). Hence we came up with a newer design
(with slight changes to original design posted earlier in this thread):

* Active/Granted lock migration involves only rebalance process. Clients
have active no role to play.

  As of now, the only state that changes from src to dst during migration
is connection id. Also note that clients most likely have a connection
already established to destination. So, if the rebalance process can
reconstruct the connection id (without involving client), it can just
associate locks on dst with a correct connection. Based on this we decided
to change the way we construct connection identifiers. Now on, connection
ids will have two parts - per connection specific and per process specific
information. Per connection specific identifier will be constant across
different clients (like mount process ,rebalance process etc). The
connection identifiers will be different because of a different per process
identifier. For eg., if a mount process (say mnt1) has two
clients/transports (say c1 and c2) speaking to two different bricks (say b1
and b2), then the connection identifiers,

  on b1 between b1 and c1 from mnt1 will be <mnt1:c1>
  on b2 between b2 and c2 from mnt1 will be <mnt1:c2>

 The connection identifiers from a rebalance process (rebal-process) to
same bricks would be,

  on b1 between b1 and c1 from rebal-process will be <rebal-process:c1>
  on b2 between b2 and c2 from rebal-process will be <rebal-process:c2>

  Note that connection specific part of ids is constant for rebalance
process and mount process.

  So, if rebalance process is migrating a file/lock from b1 to b2, all it
has to do is to change connection specific part of id from b1 to b2. So, if
connection identifier of lock on b1 is <mnt1:c1>, rebalance can reconstruct
the id for b2 as <mnt1:c2> (Note that it can derive connection specific id
to b2 from its own connection to b2).

  If the client has not established connection to dst brick at the time of
migration, to keep things simple rebalance process fails migration of that
file.

* Blocked locks:

  Previous design didn't address blocked locks. The thing with blocked
locks is that they have an additional state in the form of call-stack for
reply path which will be unwound when lock is granted. So, to migrate
blocked locks to dst, rebalance process asks brick to unwind all the
blocked locks with a special error (giving the information of destination
to which file is migrated). Dht in client/mnt process will interpret this
special error and will wind a lock request to dst. Active/granted locks are
migrated before blocked locks. So, new-lock requests (corresponding to
blocked locks) will block on dst too. However, note that here blocked lock
state is lost on src. So, post this point, migration to dst _has_ to
complete. We have two options to recover from this situation:

a. If there is a failure in migration, rebalance process has to migrate
back the blocked lock state from dst back to src in a similar way. Or ask
dst to fail the lock requests. But rebalance process itself might crash
before it gets an opportunity to do so. So, this is not a fool-proof
solution.

b. attempt blocked lock migration _after_ file migration is complete (when
file on destination is marked as data file by rebalance process). Since
these locks are blocked locks, unlike active locks its not an issue if some
other client tries to acquire a conflicting lock post migration but before
blocked lock migration. If there are errors, lock requests are failed and
an error response will be sent to application.

I personally feel option b is simpler and better.

* Atomicity of getlkinfo and setlkinfo done by rebalance process during
Active/granted lock migration:

  While rebalance process is in the process of migrating active locks, the
lock state can change on src b/w getlkinfo on src and setlkinfo on dst. To
preserve the atomicity, we are proposing rebalance process to hold a
mandatory write-lock (which is a "meta-lock" guarding lock-state as opposed
to file data) on src during time-period it is doing active/granted lock
migration. Any lock requests from mnt/client process b/w getlkinfo and
setlkinfo of rebalance process will be unwound with EAGAIN errors by the
brick with relevant information. To make things simple for clients to
handle these errors, posix/locks on src-brick will queue these lock
requests during the time-window specified above till an unlock is issued by
rebalance-process on the "meta" mandatory lock it had acquired before lock
migration. Once unlock is issued on "meta-lock" src-brick replays all the
locks in queue before a response to unlock is sent to rebalance process.
What happens next depends on types of lock-requests in the queue:

a. If lock gets granted, a successful response is unwound to client along
with information that file is under migration (similar to phase1 of data
migration in dht), the client then acquires the lock on destination too.
b. If lock cannot be granted and its a SETLK request, client unwinds an
error response to application.
c. If lock cannot be granted and its a SETLKW request, lock request is
queued in the block-lock request queue in posix/locks on src. Note that
since rebalance process has not attempted to do blocked lock migration at
this point in time, rebalance process will migrate this lock later.
d. If lock request is an UNLCK, it is replayed on src and response is
unwound to client. If response is successful, client also issues UNLCK on
destination too (similar to handling of phase1 stage of data migration).

On a slightly unrelated note, similar read/write atomicity issue is present
on data migration code-path of rebalance process. A similar solution of
holding mandatory-lock by rebalance process on file-data can be used to
solve it too. Thanks to Raghavendra Talur for suggestion of such a simple
solution :).

* A note on handling of potentially blocking locks on src post-migration:

Since clients can have an fd open on src post-migration, lock requests are
still routed to src even after a successful migration to dst.
Active/Granted locks during this stage are not much of an issue, but locks
that can block need to handled carefully as we cannot block on src. To
handle this scenario, rebalance process has to instruct src that migration
is complete and any lock requests post migration, src-brick simply unwinds
them with a special error asking client to replay the lock on destination
(as we cannot _block_ on src post migration). Active/granted locks don't
have this issue as a response is unwound to client and client has a chance
to acquire a lock on destination. Of course a lock granted on src post
migration can block on dst (or vice-versa). But, its not really an issue.

Thanks to Pranith, Raghavendra Talur, Poornima and members of upcall team
for their valuable inputs. Special thanks to Pranith for poking holes in
previous design :).

Comments are welcome.

regards,

On Thu, Jul 2, 2015 at 10:21 AM, Raghavendra G <raghavendra at gluster.com>
wrote:

> One solution I can think of is to have the responsibility of lock
> migration process spread between both client and rebalance process. A rough
> algo is outlined below:
>
> 1. We should've a static identifier for client process (something like
> process-uuid of mount process - lets call it client-uuid) in the lock
> structure. This identifier won't change across reconnects.
> 2. rebalance just copies the entire lock-state verbatim to dst-node
> (fd-number and client-uuid values would be same as the values on src-node).
> 3. rebalance process marks these half-migrated locks as
> "migration-in-progress" on dst-node. Any lock request which overlaps with
> "migration-in-progress" locks is considered as conflicting and dealt with
> appropriately (if SETLK unwind with EAGAIN and if SETLKW block till these
> locks are released). Same approach is followed for mandatory locking too.
> 4. whenever an fd based operation (like writev, release, lk, flush etc)
> happens on the fd, the client (through which lock was acquired), "migrates"
> the lock. Migration is basically,
>      * does a fgetxattr (fd, LOCKINFO_KEY, src-subvol). This will fetch
> the fd number on src subvol - lockinfo.
>      * opens new fd on dst-subvol. Then does fsetxattr (new-fd,
> LOCKINFO_KEY, lockinfo, dst-subvol). The brick on receiving setxattr on
> virtual xattr LOCKINFO_KEY looks for all the locks with ((fd == lockinfo)
> && (client-uuid == uuid-of-client-on-which-this-setxattr-came)) and then
> fills in appropriate values for client_t and fd (basically sets lock->fd =
> fd-num-of-the-fd-on-which-setxattr-came).
>
> Some issues and solutions:
>
> 1. What if client never connects to dst brick?
>
>      We'll have a time-out for "migration-in-progress" locks to be
> converted into "complete" locks. If DHT doesn't migrate within this
> timeout, server will cleanup these locks. This is similar to current
> protocol/client implementation of lock-heal (This functionality is disabled
> on client as of now. But, upcall needs this feature too and we can get this
> functionality working). If a dht tries to migrate the locks after this
> timeout, it'll will have to re-aquire lock on destination (This has to be a
> non-blocking lock request, irrespective of mode of original lock). We get
> information of current locks opened through the fd opened on src. If lock
> acquisition fails for some reason, dht marks the fd bad, so that
> application will be notified about lost locks. One problem unsolved with
> this solution is another client (say c2) acquiring and releasing the lock
> during the period starting from timeout and client (c1) initiates lock
> migration. However, that problem is present even with existing lock
> implementation and not really something new introduced by lock migration.
>
> 2. What if client connects but disconnects before it could've attempted to
> migrate "migration-in-progress" locks?
>
>     The server can identify locks belonging to this client using
> client-uuid and cleans them up. Dht trying to migrate locks after first
> disconnect will try to reaquire locks as outlined in 1.
>
> 3. What if client disconnects with src subvol and cannot get lock
> information from src for handling issues 1 and 2?
>
>     We'll mark the fd bad. We can optimize this to mark fd bad only if
> locks have been acquired. To do this client has to store some history in
> the fd on successful lock acquisition.
>
> regards,
>
> On Wed, Dec 17, 2014 at 12:45 PM, Raghavendra G <raghavendra at gluster.com>
> wrote:
>
>>
>>
>> On Wed, Dec 17, 2014 at 1:25 AM, Shyam <srangana at redhat.com> wrote:
>>>
>>> This mail intends to present the lock migration across subvolumes
>>> problem and seek solutions/thoughts around the same, so any
>>> feedback/corrections are appreciated.
>>>
>>> # Current state of file locks post file migration during rebalance
>>> Currently when a file is migrated during rebalance, its lock information
>>> is not transferred over from the old subvol to the new subvol, that the
>>> file now resides on.
>>>
>>> As further lock requests, post migration of the file, would now be sent
>>> to the new subvol, any potential lock conflicts would not be detected,
>>> until the locks are migrated over.
>>>
>>> The term locks above can refer to the POSIX locks aquired using the FOP
>>> lk by consumers of the volume, or to the gluster internal(?) inode/dentry
>>> locks. For now we limit the discussion to the POSIX locks supported by the
>>> FOP lk.
>>>
>>> # Other areas in gluster that migrate locks
>>> Current scheme of migrating locks in gluster on graph switches, trigger
>>> an fd migration process that migrates the lock information from the old fd
>>> to the new fd. This is driven by the gluster client stack, protocol layer
>>> (FUSE, gfapi).
>>>
>>> This is done using the (set/get)xattr call with the attr name,
>>> "trusted.glusterfs.lockinfo". Which in turn fetches the required key for
>>> the old fd, and migrates the lock from this old fd to new fd. IOW, there is
>>> very little information transferred as the locks are migrated across fds on
>>> the same subvolume and not across subvolumes.
>>>
>>> Additionally locks that are in the blocked state, do not seem to be
>>> migrated (at least the function to do so in FUSE is empty
>>> (fuse_handle_blocked_locks), need to run a test case to confirm), or
>>> responded to with an error.
>>>
>>> # High level solution requirements when migrating locks across subvols
>>> 1) Block/deny new lock acquisitions on the new subvol, till locks are
>>> migrated
>>>   - So that new locks that have overlapping ranges to the older ones are
>>> not granted
>>>   - Potentially return EINTR on such requests?
>>> 2) Ensure all _acquired_ locks from all clients are migrated first
>>>   - So that if and when placing blocked lock requests, these really do
>>> block for previous reasons and are not granted now
>>> 3) Migrate blocked locks post acquired locks are migrated (in any order?)
>>>     - OR, send back EINTR for the blocked locks
>>>
>>> (When we have upcalls/delegations added as features, those would have
>>> similar requirements for migration across subvolumes)
>>>
>>> # Potential processes that could migrate the locks and issues thereof
>>> 1) The rebalance process, that migrates the file can help with migrating
>>> the locks, which would not involve any clients to the gluster volume
>>>
>>> Issues:
>>>    - Lock information is fd specific, when migrating these locks, the
>>> clients need not have detected that the file is migrated, and hence opened
>>> an fd against the new subvol, which when missing, would make this form of
>>> migration a little more interesting
>>>    - Lock information also has client connection specific pointer
>>> (client_t) that needs to be reassigned on the new subvol
>>>    - Other subvol specific information, maintained in the lock, that
>>> needs to be migrated over will suffer the same limitations/solutions
>>>
>>
>> The tricky thing here is that rebalance process has no control over when
>> 1. fd will be opened on dst-node, since clients open fd on dst-node
>> on-demand based on the I/O happening through them.
>> 2. client establishes connection on dst-node (client might've been cut
>> off from dst-node).
>>
>> Unless we've a global mapping (like a client can always be identified
>> using same uuid irrespective of the brick we are looking) this seems like a
>> difficult thing to achieve.
>>
>>
>>> Benefits:
>>>    - Can lock out/block newer lock requests effectively
>>>    - Need not _wait_ till all clients have registered that the file is
>>> under migration and/or migrated their locks
>>>
>>> 2) DHT xlator in each client could be held responsible to migrate its
>>> locks to the new subvolume
>>>
>>> Issues:
>>>    - Somehow need to let every client know that locks need to be
>>> migrated (upcall infrastructure?)
>>>    - What if some client is not reachable at the given time?
>>>    - Have to wait till all clients replay the locks
>>>
>>> Benefits:
>>>    - Hmmm... Nothing really, if we could do it by the rebalance process
>>> itself the solution maybe better.
>>>
>>> # Overall thoughts
>>> - We could/should return EINTR for blocked locks, in the case of a graph
>>> switch, and the case of a file migration, this would relieve the design of
>>> that particular complexity, and is a legal error to return from a
>>> flock/fcntl operation
>>>
>>> - If we can extract and map out all relevant lock information across
>>> subvolumes, then having rebalance do this work seems like a good fit.
>>> Additionally this could serve as a good way to migrate upcall requests and
>>> state as well
>>>
>>> Thoughts?
>>>
>>> Shyam
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>> --
>> Raghavendra G
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151014/fa3612cb/attachment-0001.html>


More information about the Gluster-devel mailing list