[Gluster-devel] Readdir plus implementation in tier xlator

Sat Apr 23 02:20:08 UTC 2016

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Vijay Bellur" <vbellur at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Friday, April 22, 2016 11:34:35 PM
> Subject: Re: [Gluster-devel] Readdir plus implementation in tier xlator
> 
> 
> 
> ----- Original Message -----
> > From: "Vijay Bellur" <vbellur at redhat.com>
> > To: "Mohammed Rafi K C" <rkavunga at redhat.com>
> > Cc: "Gluster Devel" <gluster-devel at gluster.org>
> > Sent: Friday, April 22, 2016 9:41:34 AM
> > Subject: Re: [Gluster-devel] Readdir plus implementation in tier xlator
> > 
> > On Mon, Apr 18, 2016 at 3:28 AM, Mohammed Rafi K C <rkavunga at redhat.com>
> > wrote:
> > >
> > > Hi All,
> > >
> > > Currently we are experiencing some issues with the implementation of
> > > readdirp in data tiering.
> > >
> > > Problem statement:
> > >
> > > When we do a readdirp, tiering reads entries only from cold tier. Since
> > > the hashed subvol for all files has been set as cold tier by default we
> > > will have all the files in cold tier. Some of them will be data files
> > > and remaining will be pointer files(T files), which points to original
> > > file in hot tier. The motivation behind this implementation was to
> > > increase the performance of readdir by only looking up entries in one
> > > tier. Also we ran into an issue where some files were not listed while
> > > using the default dht_readdirp. This is because dht_readdir reads
> > > entries from each subvol sequentially. Since tiering migrates files
> > > frequently this led to an issue where if a file was migrated off a
> > > subvol before the readdir got to it, but after the readdir had processed
> > > the target subvol, it would not show up in the listing [1].
> 
> IIRC, missing of some files in directory listing was the primary motivation
> for existing implementation of tier_readdirp rather than performance. With
> cold tier being some sort of MDS (at-least for all dentries of a directory),
> we don't have to address lots of complexity that comes with frequent
> migration of files across subvols.
> 
> > >
> > > So for the files residing in hot tier we will fallback to readdir i.e,
> > > we won't give stat for such entries to application. This is because the
> > > corresponding pointer file in cold tier won't be having a proper stat.
> > > So we forced fuse clients to do a explicit lookup/stat for such entries
> > > by setting nodeid as null. Similarly in case of native nfs, we marked
> > > such entries as stale stat by setting attributes_follow = FALSE.
> > >
> > 
> > Is the explicit lookup done by the kernel fuse module or is it done in
> > our bridge layer?
> > 
> > Also does md-cache handle the case where nodeid is NULL in a readdirp
> > response?
> > 
> > 
> > 
> > > But the problem comes when we use gf_api, where we don't have any
> > > control over client behavior. So to fix this issue we have to give stat
> > > information for all the entries.
> > >
> > 
> > Apart from Samba, what other consumers of gfapi have this problem?
> > 
> > 
> > > Possible solutions:
> > > 1. Revert the tier_readdirp to something similiar to dht_readdirp, then
> > > fix problem in [1].
> > > 2. Have the tier readdirp do a lookup for every linkfile entry it finds
> > > and populate the data (which would cause a performance drop). This would
> > > mean that other translators do not need to be aware of the tier
> > > behaviour.
> > > 3. Do some sort of batched lookup in the tier readdirp layer to improve
> > > the performance.
> > >
> > > Both 2 and 3 won't give any performance benefit, but solve the problem
> > > in [1]. In fact this also not complete, because when we do the lookup
> > > (batched or single), by the time the file could have moved from the hot
> > > tier or vice versa which will again result in stale data.
> 
> Doesn't this mean lookup (irrespective whether its done independently or as
> part of readdirp) in tier (for that matter dht_lookup during rebalance) is
> broken? The file is present in the volume, but lookup returns ENOENT.
> Probably we should think about ways of fixing that. I cannot think of a
> solution right now as not finding a data file even after finding a linkto
> file is a valid scenario (imagine a lookup racing with unlink). But
> nevertheless, this is something that needs to be fixed.

Note that At any instance of time, there are three possibilities:
1. a data-file is guaranteed to be present on either of hot/cold tier - a condition which matches for most of the lifetime of a file.
2. two data-files are present - one on each hot and cold tier. This is a minor race-window at the end of migration. But in this window both files are equal in all respects (at least in terms of major attributes of iatt). So thats not an issue.
3. datafile is not present - somebody unlinked it.

so, for 1 and 2 if we do lookup "simultaneously" on both hot and cold tier using only gfid we got from reading entry from cold-tier, theoretically there is no way of missing the file. In practicality, making lookup hit on both hot and cold tier at the exact time instance is tricky. But again, migration itself will also take some finite time. So, for most of the practical use cases, winding a nameless lookup on both hot and cold tier parallely (like dht_discover) should solve the problem.

> 
> > >
> > 
> > Isn't this problem common with any of the solutions? Since tiering
> > keeps moving data without any of the clients being aware, any
> > attribute cache in the client stack can quickly go stale.
> > 
> > 
> > > 4. Revert to dht_readdirp and then instead of taking all entries from
> > > hot tier, just take only entries which has T file in cold tier.
> 
> I thought with existing model of cold tier being hashed subvol for all files,
> hot tier will only have data-files with linkto files being present on cold
> tier. Am I missing anything here?
> 
> > (We can
> > > delay deleting of data file after demotion, so that we will get the stat
> > > from hot tier)
> > >
> > 
> > Going by the architectural model of xlators, tier should provide the
> > right entries with attributes to the upper layers (xlators/vfs etc.).
> > Relying on a specific behavior from layers above us to mask a problem
> > in our layer does not seem ideal.  I would go with something like 2 or
> > 3.  If we want to retain the current behavior, we should make it
> > conditional as I am not certain that this behavior is foolproof too.
> > 
> > Thanks,
> > Vijay
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>