[Gluster-devel] Single layout at root (Was EHT / DHT)

Tue Nov 25 22:03:40 UTC 2014

On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana at redhat.com> wrote:

> On 11/12/2014 01:55 AM, Anand Avati wrote:
> >
> >
> > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy at redhat.com
> > <mailto:jdarcy at redhat.com>> wrote:
> >
> >       (Personally I would have
> >     done this by "mixing in" the parent GFID to the hash calculation, but
> >     that alternative was ignored.)
> >
> >
> > Actually when DHT was implemented, the concept of GFID did not (yet)
> > exist. Due to backward compatibility it has just remained this way even
> > later. Including the GFID into the hash has benefits.
>
> I am curious here as this is interesting.
>
> So the layout start subvol assignment for a directory to be based on its
> GFID was provided so that files with the same name distribute better
> than ending up in the same bricks, right?
>

Right, for e.g we wouldn't want all the README.txt in various directories
of a volume to end up on the same server. The way it is achieved today is,
the per server hash-range assignment is "rotated" by a certain amount (how
much it is rotated is determined by a separate hash on the directory path)
at the time of mkdir.

> Instead as we _now_ have GFID, we could use that including the name to
> get a similar/better distribution, or GFID+name to determine hashed subvol.
>

What we could do now is, include the parent directory gfid as an input into
the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash ("readme.txt")
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash ("readme.txt", parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
> the GFID and not let the bricks generate the same, so that we can choose
> the subvol to wind the FOP to.
>

The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

> This eliminates the need for a layout per sub-directory and all the
> (interesting) problems that it comes with and instead can be replaced by
> a layout at root. Not sure if it handles all use cases and paths that we
> have now (which needs more understanding).
>
> I do understand there is a backward compatibility issue here, but other
> than this, this sounds better than the current scheme, as there is a
> single layout to read/optimize/stash/etc. across clients.
>
> Can I understand the rationale of this better, as to what you folks are
> thinking. Am I missing something or over reading on the benefits that
> this can provide?
>

I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory "specific-ness" is
implemented by including the directory gfid into the hash function. The way
I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and does
not impact the entire volume. The time a given directory is undergoing
rebalance, for that directory alone we need to enter "unhashed lookup"
mode, only for that period of time.

Con per directory range: Just the new "hash assignment" phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory operations.
The number of points in the system where things can "break" (i.e, result in
overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir
hash ranges) which can potentially "break".

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new
layout) is atomic for the entire volume - unhashed lookup has to be "on"
for all dirs for the entire period. To mitigate this, we could explore
versioning the centralized hash ranges, and store the version used by each
directory in its xattrs (and update the version as the rebalance
progresses). But now we have more centralized metadata (may be/ may not be
a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

HTH,
Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141125/8e3da6e5/attachment.html>