[Gluster-devel] Readdir d_off encoding

Thu Dec 18 16:11:50 UTC 2014

On 12/17/2014 05:04 AM, Xavier Hernandez wrote:
> Just to consider all possibilities...
>
> Current architecture needs to create all directory structure on all
> bricks, and has the big problem that each directory in each brick will
> store the files in different order and with different d_off values.

I gather that this is when EC or AFR is in place, as for DHT a file is 
on one brick only.

>
> This is a serious scalability issue and have many inconveniences when
> trying to heal or detect inconsistencies between bricks (basically we
> would need to read full directory contents of each brick to compare them).

I am not quite familiar with EC so pardon the ignorance.
Why/How does d_off play a role in this healing/crawling?

>
> An alternative would be to convert directories into regular files from
> the brick point of view.
>
> The benefits of this would be:
>
> * d_off would be controlled by gluster, so all bricks would have the
> same d_off and order. No need to use any d_off mapping or transformation.
>
> * Directories could take advantage of replication and disperse self-heal
> procedures. They could be treated as files and be healed more easily. A
> corrupted brick would not produce invalid directory contents, and file
> duplication in directory listing would be avoided.
>
> * Many of the complexities in DHT, AFR and EC to manage directories
> would be removed.
>
> The main issue could be the need of an upper level xlator that would
> transform directory requests into file modifications and would be
> responsible of managing all d_off assignment and directory manipulation
> (renames, links, unlinks, ...).

This is tending towards some thoughts for Gluster 4.0 and specifically 
DHT in 4.0. I am going to wait for the same/similar comments as we 
discuss those specifics (hopefully published before Christmas (2014)).

>
> Xavi
>
> On 12/16/2014 03:06 AM, Anand Avati wrote:
>> Replies inline
>>
>> On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana at redhat.com
>> <mailto:srangana at redhat.com>> wrote:
>>
>>     With the changes present in [1] and [2],
>>
>>     A short explanation of the change would be, we encode the subvol
>> ID in
>>     the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
>>     underlying xlator returned d_off is not free. (Best to read the
>> commit
>>     message for [1] :) )
>>
>>     Although not related to the latest patch, here is something to
>> consider
>>     for the future:
>>
>>     We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
>>     encoding in the returned readdir offset. Due to this, the loss in
>> bits
>>     _may_ cause unwanted offset behavior, when used in the current
>> scheme.
>>     As we would end up eating more bits than what we do at present.
>>
>>     Or IOW, we could be invalidating the assumption "both EXT4/XFS are
>>     tolerant in terms of the accuracy of the value presented
>>     back in seekdir().
>>
>>
>> XFS has not been a problem, since it always returns 32bit d_off. With
>> Ext4, it has been noted that it is tolerant to sacrificing the lower
>> bits in accuracy.
>>
>>     i.e, a seekdir(val) actually seeks to the entry which
>>     has the "closest" true offset."
>>
>>     Should we reconsider an in memory _cookie_ like approach that can
>> help
>>     in this case?
>>
>>     It would invalidate (some or all based on the implementation) the
>>     following constraints that the current design resolves, (from, [1])
>>     - Nothing to "remember in memory" or evict "old entries".
>>     - Works fine across NFS server reboots and also NFS head failover.
>>     - Tolerant to seekdir() to arbitrary locations.
>>
>>     But, would provide a more reliable readdir offset for use (when valid
>>     and not evicted, say).
>>
>>     How would NFS adapt to this? Does Ganesha need a better scheme when
>>     doing multi-head NFS fail over?
>>
>>
>> Ganesha just offloads the responsibility to the FSAL layer to give
>> stable dir cookies (as it rightly should)
>>
>>
>>     Thoughts?
>>
>>
>> I think we need to analyze the actual assumption/problem here.
>> Remembering things in memory comes with the limitations you note above,
>> and may after all, still not be necessary. Let's look at the two
>> approaches taken:
>>
>> - Small backend offsets: like XFS, the offsets fit in 32bits, and we are
>> left with another 32bits of freedom to encode what we want. There is no
>> problem here until our nested encoding requirements cross 32bits of
>> space. So let's ignore this for now.
>>
>> - Large backend offsets: Ext4 being the primary target. Here we observe
>> that the backend filesystem is tolerant to sacrificing the accuracy of
>> lower bits. So we overwrite the lower bits with our subvolume encoding
>> information, and the number of bits used to encode is implicit in the
>> subvolume cardinality of that translator. While this works fine with a
>> single transformation, it is clearly a problem when the transformation
>> is nested with the same algorithm. The reason is quite simple: while the
>> lower bits were disposable when the cookie was taken fresh from Ext4,
>> once transformed the same lower bits are now "holy" and cannot be
>> overwritten carelessly, at least without dire consequences. The higher
>> level xlators need to take up the "next higher bits", past the previous
>> transformation boundary, to encode the next subvolume information. Once
>> the d_off transformation algorithms are fixed to give such due "respect"
>> to the lower layer's transformation and use a different real estate, we
>> might actually notice that the problem may not need such a deep redesign
>> after all.
>>
>> Hope that helps
>> Thanks
>>
>>     Shyam
>>     [1] http://review.gluster.org/#/c/__4711/
>>     <http://review.gluster.org/#/c/4711/>
>>     [2] http://review.gluster.org/#/c/__8201/
>>     <http://review.gluster.org/#/c/8201/>
>>     _________________________________________________
>>     Gluster-devel mailing list
>>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>     http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
>>     <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>