[Gluster-devel] Readdir d_off encoding

Tue Dec 16 20:21:17 UTC 2014

On Tue, Dec 16, 2014 at 11:46:46AM -0500, Shyam wrote:
> On 12/15/2014 09:06 PM, Anand Avati wrote:
> >Replies inline
> >
> >On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana at redhat.com
> ><mailto:srangana at redhat.com>> wrote:
> >
> >    With the changes present in [1] and [2],
> >
> >    A short explanation of the change would be, we encode the subvol ID in
> >    the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
> >    underlying xlator returned d_off is not free. (Best to read the commit
> >    message for [1] :) )
> >
> >    Although not related to the latest patch, here is something to consider
> >    for the future:
> >
> >    We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
> >    encoding in the returned readdir offset. Due to this, the loss in bits
> >    _may_ cause unwanted offset behavior, when used in the current scheme.
> >    As we would end up eating more bits than what we do at present.
> >
> >    Or IOW, we could be invalidating the assumption "both EXT4/XFS are
> >    tolerant in terms of the accuracy of the value presented
> >    back in seekdir().
> >
> >
> >XFS has not been a problem, since it always returns 32bit d_off. With
> >Ext4, it has been noted that it is tolerant to sacrificing the lower
> >bits in accuracy.
> >
> >    i.e, a seekdir(val) actually seeks to the entry which
> >    has the "closest" true offset."
> >
> >    Should we reconsider an in memory _cookie_ like approach that can help
> >    in this case?
> >
> >    It would invalidate (some or all based on the implementation) the
> >    following constraints that the current design resolves, (from, [1])
> >    - Nothing to "remember in memory" or evict "old entries".
> >    - Works fine across NFS server reboots and also NFS head failover.
> >    - Tolerant to seekdir() to arbitrary locations.
> >
> >    But, would provide a more reliable readdir offset for use (when valid
> >    and not evicted, say).
> >
> >    How would NFS adapt to this? Does Ganesha need a better scheme when
> >    doing multi-head NFS fail over?
> >
> >
> >Ganesha just offloads the responsibility to the FSAL layer to give
> >stable dir cookies (as it rightly should)
> >
> >
> >    Thoughts?
> >
> >
> >I think we need to analyze the actual assumption/problem here.
> >Remembering things in memory comes with the limitations you note above,
> >and may after all, still not be necessary. Let's look at the two
> >approaches taken:
> >
> >- Small backend offsets: like XFS, the offsets fit in 32bits, and we are
> >left with another 32bits of freedom to encode what we want. There is no
> >problem here until our nested encoding requirements cross 32bits of
> >space. So let's ignore this for now.
> >
> >- Large backend offsets: Ext4 being the primary target. Here we observe
> >that the backend filesystem is tolerant to sacrificing the accuracy of
> >lower bits. So we overwrite the lower bits with our subvolume encoding
> >information, and the number of bits used to encode is implicit in the
> >subvolume cardinality of that translator. While this works fine with a
> >single transformation, it is clearly a problem when the transformation
> >is nested with the same algorithm. The reason is quite simple: while the
> >lower bits were disposable when the cookie was taken fresh from Ext4,
> >once transformed the same lower bits are now "holy" and cannot be
> >overwritten carelessly, at least without dire consequences. The higher
> >level xlators need to take up the "next higher bits", past the previous
> >transformation boundary, to encode the next subvolume information. Once
> >the d_off transformation algorithms are fixed to give such due "respect"
> >to the lower layer's transformation and use a different real estate, we
> >might actually notice that the problem may not need such a deep redesign
> >after all.
> 
> Agreed, my lack of understanding though is how may bits can be
> sacrificed for ext4? I do not have that data, any pointers there
> would help. (did go through https://lwn.net/Articles/544520/ but
> that does not have the tolerance information in it)
> 
> Here is what I have as the current bits lost based on the following
> volume configuration,
> - 2 Tiers (DHT over DHT)
> - 128 subvols per DHT
> - Each DHT instance is either AFR or EC subvolumes, with 2 replicas
> and say 6 bricks per EC instance
> 
> So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) +
> log(2)2 (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to
> encode the volume, +1 for the high order bit to denote the encoding.
> (AFR would have 1 bit less, so we can consider just the EC side of
> things for the maximum loss computation at present)
> 
> Is 12 bits still a tolerable loss for ext4? Or, till how many bits
> can we still use the current scheme?

To a first approximation, assume ext4's offsets are random numbers (it's
actually some kind of weak cryptographic hash).  So in a directory with
n entries, you're basically picking n random numbers out of a hat and
hoping you never pick the same number twice.

So following preshing.com/20110504/hash-collision-probabilities, the
chance that two entries in an n-entry directory will have the same b-bit
cookie are (for small n/2^b), roughly

	n^2 / 2^(b+1)

I think 52 bits is still a lot.  Even on million-entry directories I
only get a one-in-ten-thousand chance.   But play with the numbers and
make sure I haven't messed up somewhere.

Anyway, a longer-term approach might be to fix the NFS protocol's
reliance on stable directory cookies, but of course then you have to
wait till the NFS clients get upgraded.

--b.

> 
> If we move to 1000/10000 node gluster in 4.0, assuming everything
> remains the same except DHT, we need an additional 3-5 bits for the
> DHT subvol encoding. Would this still survive the ext4 encoding
> scheme for d_off?
> 
> >
> >Hope that helps
> >Thanks
> >
> >    Shyam
> >    [1] http://review.gluster.org/#/c/__4711/
> >    <http://review.gluster.org/#/c/4711/>
> >    [2] http://review.gluster.org/#/c/__8201/
> >    <http://review.gluster.org/#/c/8201/>
> >    _________________________________________________
> >    Gluster-devel mailing list
> >    Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >    http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
> >    <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel