[Gluster-devel] Readdir d_off encoding

Shyam srangana at redhat.com
Tue Dec 16 21:22:20 UTC 2014


On 12/16/2014 03:21 PM, J. Bruce Fields wrote:
> On Tue, Dec 16, 2014 at 11:46:46AM -0500, Shyam wrote:
>> On 12/15/2014 09:06 PM, Anand Avati wrote:
>>> Replies inline
>>>
>>> On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana at redhat.com
>>> <mailto:srangana at redhat.com>> wrote:
>>>
>>>     With the changes present in [1] and [2],
>>>
>>>     A short explanation of the change would be, we encode the subvol ID in
>>>     the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
>>>     underlying xlator returned d_off is not free. (Best to read the commit
>>>     message for [1] :) )
>>>
>>>     Although not related to the latest patch, here is something to consider
>>>     for the future:
>>>
>>>     We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
>>>     encoding in the returned readdir offset. Due to this, the loss in bits
>>>     _may_ cause unwanted offset behavior, when used in the current scheme.
>>>     As we would end up eating more bits than what we do at present.
>>>
>>>     Or IOW, we could be invalidating the assumption "both EXT4/XFS are
>>>     tolerant in terms of the accuracy of the value presented
>>>     back in seekdir().
>>>
>>>
>>> XFS has not been a problem, since it always returns 32bit d_off. With
>>> Ext4, it has been noted that it is tolerant to sacrificing the lower
>>> bits in accuracy.
>>>
>>>     i.e, a seekdir(val) actually seeks to the entry which
>>>     has the "closest" true offset."
>>>
>>>     Should we reconsider an in memory _cookie_ like approach that can help
>>>     in this case?
>>>
>>>     It would invalidate (some or all based on the implementation) the
>>>     following constraints that the current design resolves, (from, [1])
>>>     - Nothing to "remember in memory" or evict "old entries".
>>>     - Works fine across NFS server reboots and also NFS head failover.
>>>     - Tolerant to seekdir() to arbitrary locations.
>>>
>>>     But, would provide a more reliable readdir offset for use (when valid
>>>     and not evicted, say).
>>>
>>>     How would NFS adapt to this? Does Ganesha need a better scheme when
>>>     doing multi-head NFS fail over?
>>>
>>>
>>> Ganesha just offloads the responsibility to the FSAL layer to give
>>> stable dir cookies (as it rightly should)
>>>
>>>
>>>     Thoughts?
>>>
>>>
>>> I think we need to analyze the actual assumption/problem here.
>>> Remembering things in memory comes with the limitations you note above,
>>> and may after all, still not be necessary. Let's look at the two
>>> approaches taken:
>>>
>>> - Small backend offsets: like XFS, the offsets fit in 32bits, and we are
>>> left with another 32bits of freedom to encode what we want. There is no
>>> problem here until our nested encoding requirements cross 32bits of
>>> space. So let's ignore this for now.
>>>
>>> - Large backend offsets: Ext4 being the primary target. Here we observe
>>> that the backend filesystem is tolerant to sacrificing the accuracy of
>>> lower bits. So we overwrite the lower bits with our subvolume encoding
>>> information, and the number of bits used to encode is implicit in the
>>> subvolume cardinality of that translator. While this works fine with a
>>> single transformation, it is clearly a problem when the transformation
>>> is nested with the same algorithm. The reason is quite simple: while the
>>> lower bits were disposable when the cookie was taken fresh from Ext4,
>>> once transformed the same lower bits are now "holy" and cannot be
>>> overwritten carelessly, at least without dire consequences. The higher
>>> level xlators need to take up the "next higher bits", past the previous
>>> transformation boundary, to encode the next subvolume information. Once
>>> the d_off transformation algorithms are fixed to give such due "respect"
>>> to the lower layer's transformation and use a different real estate, we
>>> might actually notice that the problem may not need such a deep redesign
>>> after all.
>>
>> Agreed, my lack of understanding though is how may bits can be
>> sacrificed for ext4? I do not have that data, any pointers there
>> would help. (did go through https://lwn.net/Articles/544520/ but
>> that does not have the tolerance information in it)
>>
>> Here is what I have as the current bits lost based on the following
>> volume configuration,
>> - 2 Tiers (DHT over DHT)
>> - 128 subvols per DHT
>> - Each DHT instance is either AFR or EC subvolumes, with 2 replicas
>> and say 6 bricks per EC instance
>>
>> So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) +
>> log(2)2 (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to
>> encode the volume, +1 for the high order bit to denote the encoding.
>> (AFR would have 1 bit less, so we can consider just the EC side of
>> things for the maximum loss computation at present)
>>
>> Is 12 bits still a tolerable loss for ext4? Or, till how many bits
>> can we still use the current scheme?
>
> To a first approximation, assume ext4's offsets are random numbers (it's
> actually some kind of weak cryptographic hash).  So in a directory with
> n entries, you're basically picking n random numbers out of a hat and
> hoping you never pick the same number twice.
>
> So following preshing.com/20110504/hash-collision-probabilities, the
> chance that two entries in an n-entry directory will have the same b-bit
> cookie are (for small n/2^b), roughly
>
> 	n^2 / 2^(b+1)
>
> I think 52 bits is still a lot.  Even on million-entry directories I
> only get a one-in-ten-thousand chance.   But play with the numbers and
> make sure I haven't messed up somewhere.

Well I did the (different) math on the following premise, and here is 
what I have,

For even distribution of hashes in a given range	
n	hash values generated (directory entries)
N	maximum value in the hash range
s	Separation distance of 2 values, (N/n)
b	potential bits lost, but separation still preserved, floor(log(s)(base 2))

Basically as we are losing some bits to our encoding (lower bits of the 
d_off), given a 'k' (as the d_off) finding the nearest hash that it 
could represent, would mean losing less than 1/2 the bits of the 
distance separating 2 hash values (assuming even distribution of hash 
values). Which is what 'b' is above.

With this and N set to represent a 64 bit range, we would need > ~16 
million entries, before losing 20 bits of information starts mattering.

Considering this, it seems safe enough against ext4 (assuming the hash 
function that is used has the property of value distribution evenly 
across the range (at least to a point)) and assuming we even grow to 1/2 
a million bricks in a gluster volume (~2^19).

Alert: Above math needs checking and verification

>
> Anyway, a longer-term approach might be to fix the NFS protocol's
> reliance on stable directory cookies, but of course then you have to
> wait till the NFS clients get upgraded.

Also, other file systems may choose to use the d_off cookie differently, 
but I assume that is handled when it appears on the horizon.

>
> --b.

Thanks,
Shyam

>
>>
>> If we move to 1000/10000 node gluster in 4.0, assuming everything
>> remains the same except DHT, we need an additional 3-5 bits for the
>> DHT subvol encoding. Would this still survive the ext4 encoding
>> scheme for d_off?
>>
>>>
>>> Hope that helps
>>> Thanks
>>>
>>>     Shyam
>>>     [1] http://review.gluster.org/#/c/__4711/
>>>     <http://review.gluster.org/#/c/4711/>
>>>     [2] http://review.gluster.org/#/c/__8201/
>>>     <http://review.gluster.org/#/c/8201/>
>>>     _________________________________________________
>>>     Gluster-devel mailing list
>>>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>     http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
>>>     <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel


More information about the Gluster-devel mailing list