[Gluster-devel] Readdir d_off encoding

Tue Dec 16 18:12:19 UTC 2014

On Tue Dec 16 2014 at 8:46:48 AM Shyam <srangana at redhat.com> wrote:

> On 12/15/2014 09:06 PM, Anand Avati wrote:
> > Replies inline
> >
> > On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana at redhat.com
> > <mailto:srangana at redhat.com>> wrote:
> >
> >     With the changes present in [1] and [2],
> >
> >     A short explanation of the change would be, we encode the subvol ID
> in
> >     the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
> >     underlying xlator returned d_off is not free. (Best to read the
> commit
> >     message for [1] :) )
> >
> >     Although not related to the latest patch, here is something to
> consider
> >     for the future:
> >
> >     We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
> >     encoding in the returned readdir offset. Due to this, the loss in
> bits
> >     _may_ cause unwanted offset behavior, when used in the current
> scheme.
> >     As we would end up eating more bits than what we do at present.
> >
> >     Or IOW, we could be invalidating the assumption "both EXT4/XFS are
> >     tolerant in terms of the accuracy of the value presented
> >     back in seekdir().
> >
> >
> > XFS has not been a problem, since it always returns 32bit d_off. With
> > Ext4, it has been noted that it is tolerant to sacrificing the lower
> > bits in accuracy.
> >
> >     i.e, a seekdir(val) actually seeks to the entry which
> >     has the "closest" true offset."
> >
> >     Should we reconsider an in memory _cookie_ like approach that can
> help
> >     in this case?
> >
> >     It would invalidate (some or all based on the implementation) the
> >     following constraints that the current design resolves, (from, [1])
> >     - Nothing to "remember in memory" or evict "old entries".
> >     - Works fine across NFS server reboots and also NFS head failover.
> >     - Tolerant to seekdir() to arbitrary locations.
> >
> >     But, would provide a more reliable readdir offset for use (when valid
> >     and not evicted, say).
> >
> >     How would NFS adapt to this? Does Ganesha need a better scheme when
> >     doing multi-head NFS fail over?
> >
> >
> > Ganesha just offloads the responsibility to the FSAL layer to give
> > stable dir cookies (as it rightly should)
> >
> >
> >     Thoughts?
> >
> >
> > I think we need to analyze the actual assumption/problem here.
> > Remembering things in memory comes with the limitations you note above,
> > and may after all, still not be necessary. Let's look at the two
> > approaches taken:
> >
> > - Small backend offsets: like XFS, the offsets fit in 32bits, and we are
> > left with another 32bits of freedom to encode what we want. There is no
> > problem here until our nested encoding requirements cross 32bits of
> > space. So let's ignore this for now.
> >
> > - Large backend offsets: Ext4 being the primary target. Here we observe
> > that the backend filesystem is tolerant to sacrificing the accuracy of
> > lower bits. So we overwrite the lower bits with our subvolume encoding
> > information, and the number of bits used to encode is implicit in the
> > subvolume cardinality of that translator. While this works fine with a
> > single transformation, it is clearly a problem when the transformation
> > is nested with the same algorithm. The reason is quite simple: while the
> > lower bits were disposable when the cookie was taken fresh from Ext4,
> > once transformed the same lower bits are now "holy" and cannot be
> > overwritten carelessly, at least without dire consequences. The higher
> > level xlators need to take up the "next higher bits", past the previous
> > transformation boundary, to encode the next subvolume information. Once
> > the d_off transformation algorithms are fixed to give such due "respect"
> > to the lower layer's transformation and use a different real estate, we
> > might actually notice that the problem may not need such a deep redesign
> > after all.
>
> Agreed, my lack of understanding though is how may bits can be
> sacrificed for ext4? I do not have that data, any pointers there would
> help. (did go through https://lwn.net/Articles/544520/ but that does not
> have the tolerance information in it)
>
> Here is what I have as the current bits lost based on the following
> volume configuration,
> - 2 Tiers (DHT over DHT)
> - 128 subvols per DHT
> - Each DHT instance is either AFR or EC subvolumes, with 2 replicas and
> say 6 bricks per EC instance
>
> So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2
> (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the
> volume, +1 for the high order bit to denote the encoding. (AFR would
> have 1 bit less, so we can consider just the EC side of things for the
> maximum loss computation at present)
>
> Is 12 bits still a tolerable loss for ext4? Or, till how many bits can
> we still use the current scheme?
>
> If we move to 1000/10000 node gluster in 4.0, assuming everything
> remains the same except DHT, we need an additional 3-5 bits for the DHT
> subvol encoding. Would this still survive the ext4 encoding scheme for
> d_off?
>
>

In theory, we need at least log_base2(#of bricks) bits for storing the
information. If we are creative enough, in making the various layers
co-operate, we could get away with just that minimum, independent of the
number of xlator layers.

One example approach (not necessarily the best): Make every xlator knows
the total number of leaf xlators (protocol/clients), and also the number of
all leaf xlators from each of its subvolumes. This way, the protocol/client
xlators (alone) do the encoding, by knowing its global brick# and total #of
bricks. The cluster xlators blindly forward the readdir_cbk without any
further transformations of the d_offs, and also route the next
readdir(old_doff) request to the appropriate subvolume based on the
weighted graph (of counts of protocol/clients in the subtrees) till it
reaches the right protocol/client to resume the enumeration.

There may be better/even simpler approaches too (especially one that does
not need global awareness of xlator counts), and finding such a stateless
solution, and remaining NFS friendly is well worth the effort IMO.

Thanks

> >
> > Hope that helps
> > Thanks
> >
> >     Shyam
> >     [1] http://review.gluster.org/#/c/__4711/
> >     <http://review.gluster.org/#/c/4711/>
> >     [2] http://review.gluster.org/#/c/__8201/
> >     <http://review.gluster.org/#/c/8201/>
> >     _________________________________________________
> >     Gluster-devel mailing list
> >     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >     http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
> >     <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141216/53446e8f/attachment.html>