[Gluster-devel] Readdir plus implementation in tier xlator

Fri Apr 22 05:46:48 UTC 2016

comments are inline.

On 04/22/2016 09:41 AM, Vijay Bellur wrote:
> On Mon, Apr 18, 2016 at 3:28 AM, Mohammed Rafi K C <rkavunga at redhat.com> wrote:
>> Hi All,
>>
>> Currently we are experiencing some issues with the implementation of
>> readdirp in data tiering.
>>
>> Problem statement:
>>
>> When we do a readdirp, tiering reads entries only from cold tier. Since
>> the hashed subvol for all files has been set as cold tier by default we
>> will have all the files in cold tier. Some of them will be data files
>> and remaining will be pointer files(T files), which points to original
>> file in hot tier. The motivation behind this implementation was to
>> increase the performance of readdir by only looking up entries in one
>> tier. Also we ran into an issue where some files were not listed while
>> using the default dht_readdirp. This is because dht_readdir reads
>> entries from each subvol sequentially. Since tiering migrates files
>> frequently this led to an issue where if a file was migrated off a
>> subvol before the readdir got to it, but after the readdir had processed
>> the target subvol, it would not show up in the listing [1].
>>
>> So for the files residing in hot tier we will fallback to readdir i.e,
>> we won't give stat for such entries to application. This is because the
>> corresponding pointer file in cold tier won't be having a proper stat.
>> So we forced fuse clients to do a explicit lookup/stat for such entries
>> by setting nodeid as null. Similarly in case of native nfs, we marked
>> such entries as stale stat by setting attributes_follow = FALSE.
>>
> Is the explicit lookup done by the kernel fuse module or is it done in
> our bridge layer?

it is an explicit lookup done by the kernel.

>
> Also does md-cache handle the case where nodeid is NULL in a readdirp response?

if entry->inode set as null, reaaddirp won't cache that entry.

>
>
>
>> But the problem comes when we use gf_api, where we don't have any
>> control over client behavior. So to fix this issue we have to give stat
>> information for all the entries.
>>
> Apart from Samba, what other consumers of gfapi have this problem?

In nfs-ganesha, What I understand is, they are not sending readdirp. So
there we are good. But any other app which always expect a valid
response from readdirp will fail.

>
>
>> Possible solutions:
>> 1. Revert the tier_readdirp to something similiar to dht_readdirp, then
>> fix problem in [1].
>> 2. Have the tier readdirp do a lookup for every linkfile entry it finds
>> and populate the data (which would cause a performance drop). This would
>> mean that other translators do not need to be aware of the tier behaviour.
>> 3. Do some sort of batched lookup in the tier readdirp layer to improve
>> the performance.
>>
>> Both 2 and 3 won't give any performance benefit, but solve the problem
>> in [1]. In fact this also not complete, because when we do the lookup
>> (batched or single), by the time the file could have moved from the hot
>> tier or vice versa which will again result in stale data.
>>
> Isn't this problem common with any of the solutions? Since tiering
> keeps moving data without any of the clients being aware, any
> attribute cache in the client stack can quickly go stale.

That is right.
>
>
>> 4. Revert to dht_readdirp and then instead of taking all entries from
>> hot tier, just take only entries which has T file in cold tier. (We can
>> delay deleting of data file after demotion, so that we will get the stat
>> from hot tier)
>>
> Going by the architectural model of xlators, tier should provide the
> right entries with attributes to the upper layers (xlators/vfs etc.).
> Relying on a specific behavior from layers above us to mask a problem
> in our layer does not seem ideal.  I would go with something like 2 or
> 3.  If we want to retain the current behavior, we should make it
> conditional as I am not certain that this behavior is foolproof too.

If we make the changes in tier_readdirp, then it effects the performance
of plane readdir (if md-cache was on). we may need to  turn off volume
option "performance.force-readdirp". What do you think here ?

Rafi

>
> Thanks,
> Vijay