[Gluster-devel] RFC on posix locks migration to new graph after a switch

Tue Oct 16 08:43:40 UTC 2012

My previous mail addressed a graph switch scenario where no bricks were added or removed. However, there is a case where we increase the replica/stripe count where we need to acquire the locks freshly on newly added bricks. This can be done in client_open_cbk. If there is a lock context stored in fd, client_open_cbk need to reacquire all the locks stored. On bricks which were already part of old volume, these lock requests are overhead, since they'll surely conflict with locks held by old-graph. However, there is no loss of correctness.

regards,
----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Anand Avati" <aavati at redhat.com>
> Cc: gluster-devel at nongnu.org
> Sent: Monday, October 15, 2012 5:13:59 PM
> Subject: Re: [Gluster-devel] RFC on posix locks migration to new graph after a	switch
> 
> 1. For every fd which is opened in oldgraph when graph switch has
> happened, fuse does getxattr (oldfd, LOCKINFO, dict)
> 
> 2. posix-locks fills appropriate values for LOCKINFO xattr. Each
> posix-locks translator in the volume fills dict with following
> key/value pairs
>    
>    key: combination of its hostname and brick-name
>    value: If locks are held on this fd, fd_no field of the lock
>    structure. Currently fd_no is (uint64_t)fdptr and hence it is
>    guaranteed to be unique across all the connections (In future if
>    this changes, we need to add connection identifier to the value
>    too.)
> 
>    cluster translators send getxattr and setxattr calls with LOCKINFO
>    as key to the same children to which setlk would've been sent. If
>    getxattr is sent to more than one children, results are
>    aggregated in cluster translators.
> 
> 3. fuse does a setxattr (newfd, LOCKINFO, dict). dict is the result
> of getxattr and newfd is opened in new graph.
> 
> 4. a. posix-locks looks into dict with <hostname, brick-name>
> combination as key
>    b. if there is a value, the value is treated as oldfd_no. For all
>    the locks opened on oldfd_no, it changes the following fields of
>    lock structure:
>       i) lock->fd_no = fd_to_fdno (newfd)
>       ii) lock->trans = connection identifier of the connection on
>       which setxattr came
> 
> regards,
> Raghavendra.
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > To: "Anand Avati" <aavati at redhat.com>
> > Cc: gluster-devel at nongnu.org
> > Sent: Thursday, June 21, 2012 1:07:11 AM
> > Subject: [Gluster-devel] RFC on posix locks migration to new graph
> > after a	switch
> > 
> > Avati,
> > 
> > We had relied on posix lock-healing (here after locks refer to
> > posix
> > locks) done by protocol/client for lock migration to new graph.
> > Lock
> > healing is a feature implemented by protocol/client which simply
> > reacquires all the granted locks stored in fd context after a
> > reconnect to server. The way we leverage this lock healing feature
> > of protocol client to migrate posix locks to new graph is: we
> > migrate fds to new graph by opening a new fd on the same file in
> > new
> > graph (with fd context copied from old graph) and protocol/client
> > reacquires all the granted locks in fd context. But, this solution
> > has following issues:
> > 
> > 1.If we open fds in new graph even before cleaning up of the
> > old-transport, lock requests sent by protocol/client as part of
> > healing will conflict with locks held on old-tranport and hence
> > will
> > fail (Note that with only client-side graph switch there is a
> > single
> > inode on server corresponding to two inodes - one corresponding to
> > each of old and new graphs - on client). As a result locks are not
> > migrated. The problem could've been solved if protocol/client had
> > issued SETLKW requests instead of SETLK (the lock requests issued
> > as
> > part of healing would be granted when old-transport disconnects
> > eventually). But, that has different set of issues. Even then, this
> > is not a fool-proof solution, since there might already be other
> > conflicting lock requests in the lock wait queue when
> > protocol/client starts lock healing resulting in failure of
> > lock-heal.
> > 
> > 2. If we open fds in new graph after cleaning of old-transport,
> > there
> > is a window of time b/w old-tranport cleanup and lock-heal in new
> > graph where potentially conflicting lock requests could be granted,
> > there by causing lock requests sent as part of lock healing to
> > fail.
> > 
> > One solution I can think of is to bring in a SETLK_MIGRATE lock
> > command. SETLK_MIGRATE takes a transport identifier as a parameter
> > along with usual arguments SETLK/SETLKW take (like lock range,
> > lock-owner etc). SETLK_MIGRATE command migrates a lock from the
> > transport passed as a parameter to the transport on which request
> > came in, if two locks conflict only because they came from two
> > different transports (all else - lock-range, lock-owner etc - being
> > same). On absence of any live locks, SETLK_MIGRATE behaves similar
> > to SETLK command.
> > 
> > protocol/client can make use of this SETLK_MIGRATE command in lock
> > requests it sends as part of lock heal during open fop to migrate
> > locks to new graph. Assuming that old-transport is not cleaned up
> > at
> > the time of lock-heal, SETLK_MIGRATE atomically migrates locks from
> > old-transport to new-transport (on server). Now, the difficulty is
> > in getting the identifier to old-transport on server from which
> > locks are currently held. This can be solved if we store the peer
> > transport identifier in lk-context on client (which can be easily
> > obtained in an lk reply). We can pass the same transport identifier
> > to server during healing.
> > 
> > I haven't yet completely thought of some issues like whether
> > protocol/client can unconditionally use SETLK_MIGRATE in all lock
> > requests it sends as part of healing or it should use SETLK_MIGRATE
> > only during first attempt of healing after a graph-switch. However
> > even if protocol/client wants to make such distinction, it can be
> > easily worked out (either by fuse setting a special "migrate" key
> > in
> > xdata of open calls it sends as part of fd-migration or some
> > different mechanism).
> > 
> > Please let me know your thoughts on this.
> > 
> > regards,
> > Raghavendra.
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
>