[Gluster-devel] RFC: d_off encoding at client/protocol layer

Dan Lambright dlambrig at redhat.com
Mon Feb 2 14:11:36 UTC 2015


Hello,

We have a prototype of this working on the tiering forge site, and noticed that in this scheme, each client translator needs to "know" the total number of #bricks in the volume. We can compute that number when the graph is created, and on a graph switch. But in some sense, a downside of this is it runs against a gluster design principal that a translator is only "aware" of the translators it is attached to. 

I can imagine an alternative: the d_off could set aside a fixed number of bits for the sub volume. The number would be as many needed for the maximum number of sub volumes we support. So for example,  if we support 2048 sub volumes, 11 bits would be set aside in the d_off. This would mean the client translator would not need to know the #bricks.

The disadvantage to this approach is the more bits we use for the sub volume, the fewer remaining are available for the offset handed to us from the file system- and this can lead to a greater probability of losing track of which file in a directory we left off on. 

Overall, it seems the arguments I've heard in support of keeping with the current approach (dynamic # bits set aside for the sub volume) are stronger, given in most deployments the number of sub volumes occupies a much smaller number of bits.

Dan

----- Original Message -----
> From: "Shyam" <srangana at redhat.com>
> To: "Gluster Devel" <gluster-devel at gluster.org>
> Cc: "Dan Lambright" <dlambrig at redhat.com>
> Sent: Monday, January 26, 2015 8:59:14 PM
> Subject: RFC: d_off encoding at client/protocol layer
> 
> Hi,
> 
> Some parts of this topic has been discussed in the recent past here [1]
> 
> The current mechanism of each xlator encoding the subvol in the lower or
> higher bits has its pitfalls as discussed in the threads and in this
> review, here [2]
> 
> Here is a solution design from the one of the comments posted on this by
> Avati here, [3], as in,
> 
> "One example approach (not necessarily the best): Make every xlator
> knows the total number of leaf xlators (protocol/clients), and also the
> number of all leaf xlators from each of its subvolumes. This way, the
> protocol/client xlators (alone) do the encoding, by knowing its global
> brick# and total #of bricks. The cluster xlators blindly forward the
> readdir_cbk without any further transformations of the d_offs, and also
> route the next readdir(old_doff) request to the appropriate subvolume
> based on the weighted graph (of counts of protocol/clients in the
> subtrees) till it reaches the right protocol/client to resume the
> enumeration."
> 
> So the current proposed scheme that is being worked on is as follows,
> - encode the d_off with the client/protocol ID, which is generated as
> its leaf position/number
> - no further encoding in any other xlator
> - on receiving further readdir requests with the d_off, consult the,
> graph/or immediate children, on ID encoded in the d_off, and send the
> request down that subvol path
> 
> IOW, given a d_off and a common routine, pass the d_off with this (i.e
> current xlator) to get a subvol that the d_off belongs to. This routine
> would decode the d_off for the leaf ID as encoded in the client/protocol
> layer, and match its subvol relative to this and send that for further
> processing. (it may consult the graph or store the range of IDs that any
> subvol has w.r.t client/protocol and deliver the result appropriately).
> 
> Given the current situation of ext4 and xfs, and continuing with the ID
> encoding scheme, this seems to be the best manner of preventing multiple
> encoding of subvol stomping on each other, and also preserving (in a
> sense) further loss of bits. This scheme would also give AFR/EC the
> ability to load balance readdir requests across its subvols better, than
> have a static subvol to send to for a longer duration.
> 
> Thoughts/comments?
> 
> Shyam
> 
> [1] https://www.mail-archive.com/gluster-devel@gluster.org/msg02834.html
> [2] review.gluster.org/#/c/8201/4/xlators/cluster/afr/src/afr-dir-read.c
> [3] https://www.mail-archive.com/gluster-devel@gluster.org/msg02847.html
> 


More information about the Gluster-devel mailing list