[Gluster-devel] dht renamedir transactions, failures and crash consistency

Thu Nov 3 04:25:37 UTC 2016


----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Gluster Devel" <gluster-devel at gluster.org>
> Cc: "Nithya Balachandran" <nbalacha at redhat.com>, "Csaba Henk" <chenk at redhat.com>, "Kotresh Hiremath Ravishankar"
> <khiremat at redhat.com>
> Sent: Thursday, November 3, 2016 9:51:42 AM
> Subject: Re: dht renamedir transactions, failures and crash consistency
> 
> + gluster-devel
> 
> ----- Original Message -----
> > > > 
> > > > Hi all,
> > > > 
> > > > This mail is to consolidate three efforts that are in progress to fix
> > > > issues
> > > > around renamedir codepath in dht:
> > > > 
> > > > 1. Transactions by Kotresh [1]- this makes renamedir atomic (barring
> > > > failures
> > > > and crash consistency issues) wrt ops like mkdir, lookup-heal, rmdir.
> > > 
> > > Please note that transactions too are inadequate to address
> > > crash-consistency/snapshot related issues.
> > > 
> > > > 
> > > > 2. Rollback of renamedirs successfully completed on some subvols in
> > > > case
> > > > of
> > > > failed renamedir - Csaba is working on this (patch yet to be posted).

Rollback patch for renamedir can be found at http://review.gluster.org/15739

> > > > The
> > > > idea discussed involves dht_renamedir remembering result of renamedir
> > > > from
> > > > each subvol and rolling back the successful operations in case of
> > > > renamedir
> > > > failure. Note that this approach won't solve the issues with client
> > > > crashing
> > > > in the middle of a renamedir or issues with taking snapshots (after
> > > > restoring them) while a renamedir is in progress.
> > > > 
> > > > 3. A proposal by Nithya to fail mkdir during directory self-heal
> > > > initiated
> > > > by
> > > > dht_lookup codepath. This will
> > > >    3a. Solve the race between a lookup(src)/lookup (dst) and rename
> > > >    (src,
> > > >    dst) (as lookup won't be able to create src/dst).
> > > >    3b. Won't worsen the situation by messing up with gfid handles (on
> > > >    backend) due to lookup heal creating either src or dst or both after
> > > >    a
> > > >    failed renamedir.
> > > > 
> > > >    However solution 3 is a damage control and won't fix all things with
> > > >    a
> > > >    failed renamedir.
> > > > 
> > > > I think there is quite a bit of dependency among all the three
> > > > approaches.
> > > > 
> > > > Problem 2 has dependency on 1 and 3 as:
> > > > 1. lookup heal could've already healed src/dst or both before we try to
> > > > roll-back
> > > > 2. transactions (by locking out lookup-heal) or proposal 3 (by failing
> > > > heal)
> > > > make sure that directory namespace is not tampered till a renamedir is
> > > > complete and hence paving way for rollback.
> > > > 
> > > > Also we can build on top of 3 to recover from crashed renamedir or
> > > > restored
> > > > snapshots in lookup-heal (essentially solution 2, implemented in
> > > > lookup-heal
> > > > to either rollback/rollforward). My thoughts are below:
> > > > 
> > > > Once transactions for entry operations corresponding to directory are
> > > > in
> > > > place, lookup-selfheal will be able to identify a failed renamedir
> > > > operation
> > > > as:
> > > > 
> > > > 1. It can figure out a gfid has been associated with more than one
> > > > directory.
> > > > For this, we need to make either mkdir during healing fail with EEXIST
> > > > if
> > > > directory exists - Proposal 3 above (and possibly return the other path
> > > > associated with gfid) or do a lookup on gfid and fetch paths associated
> > > > with
> > > > gfid.
> > > > 2. No renamedir is in-progress (as we are in a transaction) and
> > > > renamedir
> > > > is
> > > > the only operation (apart from mkdir and rmdir) that changes the
> > > > association
> > > > b/w a path and gfid for directories.
> > > > 
> > > > Once we are able to identify a failed renamedir, we can possibly
> > > > rollback.
> > > > The ambiguous thing here is to figure out whether renamedir was a
> > > > failure
> > > > (client crash scenario) or succeeded (snapshots). Since, for snapshots
> > > > it
> > > > doesn't make a difference whether renamedir succeeded or failed, we can
> > > > always assume the case of failure and implement rollback.
> > 
> > After today's meeting following are the problems with rollback after a
> > crash
> > of client doing renamedir (or recovery of a snapshotted volume with
> > renamedir in progress):
> > 
> > 1. Where to put recovery code?
> >    The code has to be put in all places which modify the directory path
> >    i.e,
> >    rmdir, renamedir and lookup-heal. The reason is another client might've
> >    already issued a parallel operation and blocked on locks. The moment the
> >    client with renamedir in-progress crashes, the other
> >    rmdir/renamedir/lookup-heal would get the lock and proceed. So, all
> >    these
> >    fops should be able to identify a crashed renamedir op and recover from
> >    it.
> > 
> > 2. How to identify src/dst (of crashed renamedir) for rollback?
> >    Preferred way to store the src and dst on brick and use that information
> >    for rollback. Proposal to see whether JBR helps.
> > 
> > We decided not go ahead with providing crash consistency for renamedir
> > given
> > the above complexity and also relative infrequency of the occurrence of
> > this
> > issue. However, if snapshots become popular we may have to revisit the
> > problem.
> > 
> > Other three efforts will be continued.
> > 
> > > > 
> > > > In nutshell 1 and 3 are two relatively independent changes which can be
> > > > leveraged by 2.
> > > > 
> > > > Comments?
> > > > 
> > > > [1] http://review.gluster.org/15472
> > > > 
> > > > regards,
> > > > Raghavendra
> > > > 
> > > > 
> > > 
> > > 
> > 
>