[Gluster-devel] Readdir d_off encoding

Tue Dec 16 02:06:36 UTC 2014

Replies inline

On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana at redhat.com> wrote:

> With the changes present in [1] and [2],
>
> A short explanation of the change would be, we encode the subvol ID in
> the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
> underlying xlator returned d_off is not free. (Best to read the commit
> message for [1] :) )
>
> Although not related to the latest patch, here is something to consider
> for the future:
>
> We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
> encoding in the returned readdir offset. Due to this, the loss in bits
> _may_ cause unwanted offset behavior, when used in the current scheme.
> As we would end up eating more bits than what we do at present.
>
> Or IOW, we could be invalidating the assumption "both EXT4/XFS are
> tolerant in terms of the accuracy of the value presented
> back in seekdir().

XFS has not been a problem, since it always returns 32bit d_off. With Ext4,
it has been noted that it is tolerant to sacrificing the lower bits in
accuracy.

> i.e, a seekdir(val) actually seeks to the entry which
> has the "closest" true offset."
>
> Should we reconsider an in memory _cookie_ like approach that can help
> in this case?
>
> It would invalidate (some or all based on the implementation) the
> following constraints that the current design resolves, (from, [1])
> - Nothing to "remember in memory" or evict "old entries".
> - Works fine across NFS server reboots and also NFS head failover.
> - Tolerant to seekdir() to arbitrary locations.
>
> But, would provide a more reliable readdir offset for use (when valid
> and not evicted, say).
>
> How would NFS adapt to this? Does Ganesha need a better scheme when
> doing multi-head NFS fail over?
>

Ganesha just offloads the responsibility to the FSAL layer to give stable
dir cookies (as it rightly should)

>
> Thoughts?
>
>
I think we need to analyze the actual assumption/problem here. Remembering
things in memory comes with the limitations you note above, and may after
all, still not be necessary. Let's look at the two approaches taken:

- Small backend offsets: like XFS, the offsets fit in 32bits, and we are
left with another 32bits of freedom to encode what we want. There is no
problem here until our nested encoding requirements cross 32bits of space.
So let's ignore this for now.

- Large backend offsets: Ext4 being the primary target. Here we observe
that the backend filesystem is tolerant to sacrificing the accuracy of
lower bits. So we overwrite the lower bits with our subvolume encoding
information, and the number of bits used to encode is implicit in the
subvolume cardinality of that translator. While this works fine with a
single transformation, it is clearly a problem when the transformation is
nested with the same algorithm. The reason is quite simple: while the lower
bits were disposable when the cookie was taken fresh from Ext4, once
transformed the same lower bits are now "holy" and cannot be overwritten
carelessly, at least without dire consequences. The higher level xlators
need to take up the "next higher bits", past the previous transformation
boundary, to encode the next subvolume information. Once the d_off
transformation algorithms are fixed to give such due "respect" to the lower
layer's transformation and use a different real estate, we might actually
notice that the problem may not need such a deep redesign after all.

Hope that helps
Thanks

> Shyam
> [1] http://review.gluster.org/#/c/4711/
> [2] http://review.gluster.org/#/c/8201/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141216/9bc69ba1/attachment-0001.html>