[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)

Raghavendra Gowdappa rgowdapp at redhat.com
Mon Jul 20 07:09:28 UTC 2015


+gluster-devel

----- Original Message -----
> From: "Dan Lambright" <dlambrig at redhat.com>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran" <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> Sent: Monday, July 20, 2015 8:23:16 AM
> Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> 
> 
> I am posting another version of the patch to discuss.. Here is a summary in
> simplest form;
> 
> The fix tries to address problems we have with tiered volumes and fix-layout.
> 
> If we try to use both the hot and cold tier before fix-layout has completed,
> we get many "stale file" errors; the new hot tier does not have layouts for
> the inodes.
> 
> To avoid such problems, we only use the cold tier until fix-layout is done.
> (sub volume count = 1)
> 
> When we detect fix layout is done, we will do a graph switch which will
> create new layouts on demand. We would like to switch to using both tiers
> (subvolume_cnt=2) only at the time of the graph switch is done.
> 
> There is a hole with that solution. If we make a directory after fix layout
> has past the parent (of the new directory), fix-layout will not copy the new
> directory to the new tier.
> 
> If we try to access such directories, the code fails (dht_access does not
> have a cached sub volume).
> 
> So, we detect such directories when we do a lookup/revalidate/discover, and
> store their peculiar state in the layout if they are only accessible on the
> cold tier. Eventually a self heal will happen, and this state will age out.
> 
> I have a unit test and system test for this.
> 
> Basically my questions are
> - cleanest way to invoke the graph switch between using just the cold tier to
> using both the cold and hot tier.

This is a long standing problem which needs a fix very badly. I think the client/mount cannot rely on rebalance/tier process for directory creation since I/O on client is independent and there is no way to synchronize it with rebalance directory heal. The culprit here is lack of hierarchical named lookups from root till that directory after a graph switch in mount process. If named-lookups are sent, dht is quite capable of creating directories on newly added subvols. So, I am proposing some solutions below.

Interface layers (fuse-bridge, gfapi, nfs etc) should make sure that there is at least once entire directory hierarchy till root is looked up before sending fops on an inode after graph-switch. For dht, its sufficient only if inodes associated with directory are looked up in this fashion. However, non-directory inodes might also benefit from this since VFS essentially would've done a hierarchical lookup before doing fops. Its only glusterfs which has introduced nameless lookups, but much of the logic is designed around named hierarchical lookup. Now, to address the question whether its possible for interface layers to figure out ancestry of an inode,

    * With fuse-bridge, entire dentry structure is preserved (at least in the first graph which witnessed named-lookups from kernel and we can migrate this structure to newer graphs too). We can use dentry structure from older graph to send these named lookups and build similar dentry structure in newer graph too. This resolution is still on-demand when a fop is sent on an inode (like existing code, but the change being instead of one nameless lookup on inode, we do named lookup of parents and inode in newer graph). So, named lookups can be sent for all inodes irrespective of whether inode corresponds to directory or non-directory.

    * I am assuming gfapi is similar to fuse-bridge. Would need verifications from people maintaining gfapi whether my assumption is correct.

    * NFS-v3 server allows client to just pass file-handle and can construct relevant state to access the files (one of the reasons why nameless lookups were introduced in first place). Since it relies heavily on nameless lookups the dentry structure need not always be present in NFS server process. However we can borrow some ideas from [1]. If it seems that maintaining the list of parents of a file in xattrs is overkill (basically we are constructing reverse dentry tree), at least for problems faced by dht/tier its good enough we get this hierarchy for directory inodes. With gfid based backend, we can always get path/hierarchy for a directory using gfid of inode using .glusterfs directory (within .glusterfs there is a symbolic link with name of gfid whose contents can get us ancestry till root). This solution works for _all_ interface layers.

I am suspecting its not just dht, but also other cluster xlators like EC, afr, non-cluster entities like quota, geo-rep which face this issue. I am aware of atleast one problem in afr - difficulty in identifying gfid mismatch of an entry across subvols after graph switch. Geo-replication too is using some form of gfid to path conversion. So, comments from other maintainers/developers are highly appreciated.

[1] http://review.gluster.org/5951

> - is the mechanism I am using (hook to dht_get_cached_subvol, state in the
> layout structure) sufficient to prevent access to the hot tier before a self
> heal happens.
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > To: "Dan Lambright" <dlambrig at redhat.com>
> > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > Sent: Friday, July 17, 2015 9:50:27 PM
> > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > 
> > Sure. I am fine with it. We'll have a google hangout then.
> > 
> > ----- Original Message -----
> > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > Sent: Friday, July 17, 2015 10:44:47 PM
> > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > 
> > > Hi Du,
> > > 
> > > 7:30 IST PM Monday? Just like last time..
> > > 
> > > Dan
> > > 
> > > ----- Original Message -----
> > > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > To: "Dan Lambright" <dlambrig at redhat.com>
> > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > Sent: Friday, July 17, 2015 12:47:21 PM
> > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > 
> > > > Hi Dan,
> > > > 
> > > > Monday is fine with me. Let us know the time you'll be available.
> > > > 
> > > > regards,
> > > > Raghavendra.
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > > <nbalacha at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > > Sent: Friday, July 17, 2015 6:48:16 PM
> > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > 
> > > > > Du, Shyam,
> > > > > 
> > > > > Lets follow up with a meeting. Is today or Monday possible?
> > > > > 
> > > > > Dan
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > > Cc: "Shyam" <srangana at redhat.com>, "Nithya Balachandran"
> > > > > > <nbalacha at redhat.com>, "Susant Palai" <spalai at redhat.com>,
> > > > > > "Sakshi Bansal" <sabansal at redhat.com>
> > > > > > Sent: Wednesday, July 15, 2015 9:44:51 PM
> > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > 
> > > > > > Du,
> > > > > > 
> > > > > > Per our discussion today- here is a bit more info on the problem.
> > > > > > 
> > > > > > In QE, they untar a large file, and while that happens attach a
> > > > > > tier.
> > > > > > This
> > > > > > causes us to use the hot subvolume (hashed sub volume) before fixed
> > > > > > layout
> > > > > > finished. This leads to stale file handle errors.
> > > > > > 
> > > > > > I can recreate this per steps below.
> > > > > > 
> > > > > > 1. create a dist rep volume.
> > > > > > 2. mount it over FUSE.
> > > > > > 3. mkdir -p /mnt/z/z1/z2/z3
> > > > > > 4. cd /mnt/z/z1
> > > > > > 
> > > > > > # the next steps force it so fix layout is NOT done. we do not
> > > > > > start
> > > > > > the
> > > > > > rebalance daemon.
> > > > > > 
> > > > > > 5. stop volume
> > > > > > 6. attach tier
> > > > > > 7. start volume
> > > > > > 
> > > > > > 8.example1: stat z2/z3
> > > > > > 
> > > > > > 8.example2: mkdir z2/newdir
> > > > > > 
> > > > > > Either example1 or example2 produce the problem. We can end up in
> > > > > > the
> > > > > > underlying hot DHT translator:
> > > > > > dht_log_new_layout_for_dir_selfheal().
> > > > > > But
> > > > > > no
> > > > > > directories have been created on the hot sub volume. It cannot heal
> > > > > > anything, and returns stale.
> > > > > > 
> > > > > > The flow for example 2 is:
> > > > > > 
> > > > > > tier DHT: fresh lookup / hashed subvol is cold, calls lookup to
> > > > > > cold
> > > > > > 
> > > > > > tier DHT: lookup__cbk calls ht_lookup_directory on both hot AND
> > > > > > cold
> > > > > > sub
> > > > > > volumes
> > > > > > 
> > > > > > cold DHT: is revalidate is true, this works
> > > > > > 
> > > > > > hot DHT: fresh lookup / no hashed subvol
> > > > > > 
> > > > > > hot DHT: lookup dir cbk, gets -1 / 116 error for each subvol
> > > > > > 
> > > > > > tier DHT: lookup dir cbk gets -1 / 116 from hot tier
> > > > > > 
> > > > > > tier DHT : lookup dir cbk gets 0 / 117 from cold tier (ok)
> > > > > > 
> > > > > > tier DHT : then goes to self heal - dht_selfheal_directory
> > > > > > 
> > > > > > tier DHT : dht_selfheal_dir_makedir is called ; returns 0;
> > > > > > missing_dirs
> > > > > > =
> > > > > > 0.
> > > > > > 
> > > > > > fuse apparently retries this, and the process repeats a few times
> > > > > > before
> > > > > > failing to the user.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > > > > To: "Shyam" <srangana at redhat.com>
> > > > > > > Cc: "Nithya Balachandran" <nbalacha at redhat.com>, "Dan Lambright"
> > > > > > > <dlambrig at redhat.com>, "Susant Palai"
> > > > > > > <spalai at redhat.com>, "Sakshi Bansal" <sabansal at redhat.com>
> > > > > > > Sent: Wednesday, July 15, 2015 4:39:56 AM
> > > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > 
> > > > > > > If possible can we start at 7:00 PM IST? I've to leave by 8:15
> > > > > > > PM.
> > > > > > > The
> > > > > > > discussion might not be over if we start at 7:30 PM.
> > > > > > > 
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Shyam" <srangana at redhat.com>
> > > > > > > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Nithya
> > > > > > > > Balachandran"
> > > > > > > > <nbalacha at redhat.com>, "Dan Lambright"
> > > > > > > > <dlambrig at redhat.com>, "Susant Palai" <spalai at redhat.com>
> > > > > > > > Sent: Tuesday, July 14, 2015 11:04:01 PM
> > > > > > > > Subject: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > > 
> > > > > > > > Tier xlator needs some discussion on this change:
> > > > > > > > http://review.gluster.org/#/c/11368/
> > > > > > > > 
> > > > > > > > I think we can leverage tomorrow's Team on demand meeting for
> > > > > > > > the
> > > > > > > > same.
> > > > > > > > 
> > > > > > > > So request that we convene at the 7:30 PM IST for this
> > > > > > > > tomorrow,
> > > > > > > > do
> > > > > > > > let
> > > > > > > > us know if you cannot make it.
> > > > > > > > 
> > > > > > > > If there is better time let us know.
> > > > > > > > 
> > > > > > > > Shyam
> > > > > > > > 
> > > > > > >
> > > > > 
> > > > 
> > > 
> > 
> 


More information about the Gluster-devel mailing list