[Gluster-devel] Readdir plus implementation in tier xlator

Mon Apr 18 07:28:29 UTC 2016

Hi All,

Currently we are experiencing some issues with the implementation of
readdirp in data tiering.

Problem statement:

When we do a readdirp, tiering reads entries only from cold tier. Since
the hashed subvol for all files has been set as cold tier by default we
will have all the files in cold tier. Some of them will be data files
and remaining will be pointer files(T files), which points to original
file in hot tier. The motivation behind this implementation was to
increase the performance of readdir by only looking up entries in one
tier. Also we ran into an issue where some files were not listed while
using the default dht_readdirp. This is because dht_readdir reads
entries from each subvol sequentially. Since tiering migrates files
frequently this led to an issue where if a file was migrated off a
subvol before the readdir got to it, but after the readdir had processed
the target subvol, it would not show up in the listing [1].

So for the files residing in hot tier we will fallback to readdir i.e,
we won't give stat for such entries to application. This is because the
corresponding pointer file in cold tier won't be having a proper stat.
So we forced fuse clients to do a explicit lookup/stat for such entries
by setting nodeid as null. Similarly in case of native nfs, we marked
such entries as stale stat by setting attributes_follow = FALSE.

But the problem comes when we use gf_api, where we don't have any
control over client behavior. So to fix this issue we have to give stat
information for all the entries.

Possible solutions:     
1. Revert the tier_readdirp to something similiar to dht_readdirp, then
fix problem in [1].
2. Have the tier readdirp do a lookup for every linkfile entry it finds
and populate the data (which would cause a performance drop). This would
mean that other translators do not need to be aware of the tier behaviour.
3. Do some sort of batched lookup in the tier readdirp layer to improve
the performance.

Both 2 and 3 won't give any performance benefit, but solve the problem
in [1]. In fact this also not complete, because when we do the lookup
(batched or single), by the time the file could have moved from the hot
tier or vice versa which will again result in stale data.

4. Revert to dht_readdirp and then instead of taking all entries from
hot tier, just take only entries which has T file in cold tier. (We can
delay deleting of data file after demotion, so that we will get the stat
from hot tier)

Please reply with your valuable suggestions/concerns. We would be glad
to look into any other possible solutions for addressing the issue.

[1]:https://bugzilla.redhat.com/show_bug.cgi?id=1283923

Regards,
Rafi KC