[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)

Mon Jul 20 07:17:35 UTC 2015

On Mon, Jul 20, 2015 at 12:39 PM, Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

> +gluster-devel
>
> ----- Original Message -----
> > From: "Dan Lambright" <dlambrig at redhat.com>
> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran" <
> nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > Sent: Monday, July 20, 2015 8:23:16 AM
> > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> >
> >
> > I am posting another version of the patch to discuss.. Here is a summary
> in
> > simplest form;
> >
> > The fix tries to address problems we have with tiered volumes and
> fix-layout.
> >
> > If we try to use both the hot and cold tier before fix-layout has
> completed,
> > we get many "stale file" errors; the new hot tier does not have layouts
> for
> > the inodes.
> >
> > To avoid such problems, we only use the cold tier until fix-layout is
> done.
> > (sub volume count = 1)
> >
> > When we detect fix layout is done, we will do a graph switch which will
> > create new layouts on demand. We would like to switch to using both tiers
> > (subvolume_cnt=2) only at the time of the graph switch is done.
> >
> > There is a hole with that solution. If we make a directory after fix
> layout
> > has past the parent (of the new directory), fix-layout will not copy the
> new
> > directory to the new tier.
> >
> > If we try to access such directories, the code fails (dht_access does not
> > have a cached sub volume).
> >
> > So, we detect such directories when we do a lookup/revalidate/discover,
> and
> > store their peculiar state in the layout if they are only accessible on
> the
> > cold tier. Eventually a self heal will happen, and this state will age
> out.
> >
> > I have a unit test and system test for this.
> >
> > Basically my questions are
> > - cleanest way to invoke the graph switch between using just the cold
> tier to
> > using both the cold and hot tier.
>
> This is a long standing problem which needs a fix very badly. I think the
> client/mount cannot rely on rebalance/tier process for directory creation
> since I/O on client is independent and there is no way to synchronize it
> with rebalance directory heal. The culprit here is lack of hierarchical
> named lookups from root till that directory after a graph switch in mount
> process. If named-lookups are sent, dht is quite capable of creating
> directories on newly added subvols. So, I am proposing some solutions below.
>
> Interface layers (fuse-bridge, gfapi, nfs etc) should make sure that there
> is at least once entire directory hierarchy till root is looked up before
> sending fops on an inode after graph-switch. For dht, its sufficient only
> if inodes associated with directory are looked up in this fashion. However,
> non-directory inodes might also benefit from this since VFS essentially
> would've done a hierarchical lookup before doing fops. Its only glusterfs
> which has introduced nameless lookups, but much of the logic is designed
> around named hierarchical lookup. Now, to address the question whether its
> possible for interface layers to figure out ancestry of an inode,
>
>     * With fuse-bridge, entire dentry structure is preserved (at least in
> the first graph which witnessed named-lookups from kernel and we can
> migrate this structure to newer graphs too). We can use dentry structure
> from older graph to send these named lookups and build similar dentry
> structure in newer graph too. This resolution is still on-demand when a fop
> is sent on an inode (like existing code, but the change being instead of
> one nameless lookup on inode, we do named lookup of parents and inode in
> newer graph). So, named lookups can be sent for all inodes irrespective of
> whether inode corresponds to directory or non-directory.
>
>     * I am assuming gfapi is similar to fuse-bridge. Would need
> verifications from people maintaining gfapi whether my assumption is
> correct.
>
>     * NFS-v3 server allows client to just pass file-handle and can
> construct relevant state to access the files (one of the reasons why
> nameless lookups were introduced in first place). Since it relies heavily
> on nameless lookups the dentry structure need not always be present in NFS
> server process. However we can borrow some ideas from [1]. If it seems that
> maintaining the list of parents of a file in xattrs is overkill (basically
> we are constructing reverse dentry tree), at least for problems faced by
> dht/tier its good enough we get this hierarchy for directory inodes. With
> gfid based backend, we can always get path/hierarchy for a directory using
> gfid of inode using .glusterfs directory (within .glusterfs there is a
> symbolic link with name of gfid whose contents can get us ancestry till
> root). This solution works for _all_ interface layers.
>
> I am suspecting its not just dht, but also other cluster xlators like EC,
> afr, non-cluster entities like quota, geo-rep which face this issue. I am
> aware of atleast one problem in afr - difficulty in identifying gfid
> mismatch of an entry across subvols after graph switch. Geo-replication too
> is using some form of gfid to path conversion. So, comments from other
> maintainers/developers are highly appreciated.
>
> [1] http://review.gluster.org/5951
>
> > - is the mechanism I am using (hook to dht_get_cached_subvol, state in
> the
> > layout structure) sufficient to prevent access to the hot tier before a
> self
> > heal happens.
> >
> > ----- Original Message -----
> > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > To: "Dan Lambright" <dlambrig at redhat.com>
> > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > Sent: Friday, July 17, 2015 9:50:27 PM
> > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > >
> > > Sure. I am fine with it. We'll have a google hangout then.
> > >
> > > ----- Original Message -----
> > > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > Sent: Friday, July 17, 2015 10:44:47 PM
> > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > >
> > > > Hi Du,
> > > >
> > > > 7:30 IST PM Monday? Just like last time..
> > > >
> > > > Dan
> > > >
> > > > ----- Original Message -----
> > > > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > To: "Dan Lambright" <dlambrig at redhat.com>
> > > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > > Sent: Friday, July 17, 2015 12:47:21 PM
> > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > Monday is fine with me. Let us know the time you'll be available.
> > > > >
> > > > > regards,
> > > > > Raghavendra.
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > > > Sent: Friday, July 17, 2015 6:48:16 PM
> > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > >
> > > > > > Du, Shyam,
> > > > > >
> > > > > > Lets follow up with a meeting. Is today or Monday possible?
> > > > > >
> > > > > > Dan
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > > > > <nbalacha at redhat.com>, "Susant Palai" <spalai at redhat.com>,
> > > > > > > "Sakshi Bansal" <sabansal at redhat.com>
> > > > > > > Sent: Wednesday, July 15, 2015 9:44:51 PM
> > > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > >
> > > > > > > Du,
> > > > > > >
> > > > > > > Per our discussion today- here is a bit more info on the
> problem.
> > > > > > >
> > > > > > > In QE, they untar a large file, and while that happens attach a
> > > > > > > tier.
> > > > > > > This
> > > > > > > causes us to use the hot subvolume (hashed sub volume) before
> fixed
> > > > > > > layout
> > > > > > > finished. This leads to stale file handle errors.
> > > > > > >
> > > > > > > I can recreate this per steps below.
> > > > > > >
> > > > > > > 1. create a dist rep volume.
> > > > > > > 2. mount it over FUSE.
> > > > > > > 3. mkdir -p /mnt/z/z1/z2/z3
> > > > > > > 4. cd /mnt/z/z1
> > > > > > >
> > > > > > > # the next steps force it so fix layout is NOT done. we do not
> > > > > > > start
> > > > > > > the
> > > > > > > rebalance daemon.
> > > > > > >
> > > > > > > 5. stop volume
> > > > > > > 6. attach tier
> > > > > > > 7. start volume
> > > > > > >
> > > > > > > 8.example1: stat z2/z3
> > > > > > >
> > > > > > > 8.example2: mkdir z2/newdir
> > > > > > >
> > > > > > > Either example1 or example2 produce the problem. We can end up
> in
> > > > > > > the
> > > > > > > underlying hot DHT translator:
> > > > > > > dht_log_new_layout_for_dir_selfheal().
> > > > > > > But
> > > > > > > no
> > > > > > > directories have been created on the hot sub volume. It cannot
> heal
> > > > > > > anything, and returns stale.
> > > > > > >
> > > > > > > The flow for example 2 is:
> > > > > > >
> > > > > > > tier DHT: fresh lookup / hashed subvol is cold, calls lookup to
> > > > > > > cold
> > > > > > >
> > > > > > > tier DHT: lookup__cbk calls ht_lookup_directory on both hot AND
> > > > > > > cold
> > > > > > > sub
> > > > > > > volumes
> > > > > > >
> > > > > > > cold DHT: is revalidate is true, this works
> > > > > > >
> > > > > > > hot DHT: fresh lookup / no hashed subvol
> > > > > > >
> > > > > > > hot DHT: lookup dir cbk, gets -1 / 116 error for each subvol
> > > > > > >
> > > > > > > tier DHT: lookup dir cbk gets -1 / 116 from hot tier
> > > > > > >
> > > > > > > tier DHT : lookup dir cbk gets 0 / 117 from cold tier (ok)
> > > > > > >
> > > > > > > tier DHT : then goes to self heal - dht_selfheal_directory
> > > > > > >
> > > > > > > tier DHT : dht_selfheal_dir_makedir is called ; returns 0;
> > > > > > > missing_dirs
> > > > > > > =
> > > > > > > 0.
> > > > > > >
> > > > > > > fuse apparently retries this, and the process repeats a few
> times
> > > > > > > before
> > > > > > > failing to the user.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > > > > To: "Shyam" <srangana at redhat.com>
> > > > > > > > Cc: "Nithya Balachandran" <nbalacha at redhat.com>, "Dan
> Lambright"
> > > > > > > > <dlambrig at redhat.com>, "Susant Palai"
> > > > > > > > <spalai at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > > > > > Sent: Wednesday, July 15, 2015 4:39:56 AM
> > > > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > >
> > > > > > > > If possible can we start at 7:00 PM IST? I've to leave by
> 8:15
> > > > > > > > PM.
> > > > > > > > The
> > > > > > > > discussion might not be over if we start at 7:30 PM.
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Shyam" <srangana at redhat.com>
> > > > > > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Nithya
> > > > > > > > > Balachandran"
> > > > > > > > > <nbalacha at redhat.com>, "Dan Lambright"
> > > > > > > > > <dlambrig at redhat.com>, "Susant Palai" <spalai at redhat.com>
> > > > > > > > > Sent: Tuesday, July 14, 2015 11:04:01 PM
> > > > > > > > > Subject: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > > >
> > > > > > > > > Tier xlator needs some discussion on this change:
> > > > > > > > > http://review.gluster.org/#/c/11368/
> > > > > > > > >
> > > > > > > > > I think we can leverage tomorrow's Team on demand meeting
> for
> > > > > > > > > the
> > > > > > > > > same.
> > > > > > > > >
> > > > > > > > > So request that we convene at the 7:30 PM IST for this
> > > > > > > > > tomorrow,
> > > > > > > > > do
> > > > > > > > > let
> > > > > > > > > us know if you cannot make it.
> > > > > > > > >
> > > > > > > > > If there is better time let us know.
> > > > > > > > >
> > > > > > > > > Shyam
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150720/74bf07db/attachment-0001.html>