[Gluster-devel] Single layout at root (Was EHT / DHT)

Wed Nov 26 01:15:58 UTC 2014

On 11/25/2014 05:03 PM, Anand Avati wrote:
>
>
> On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana at redhat.com
> <mailto:srangana at redhat.com>> wrote:
>
>     On 11/12/2014 01:55 AM, Anand Avati wrote:
>      >
>      >
>      > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy at redhat.com
>     <mailto:jdarcy at redhat.com>
>      > <mailto:jdarcy at redhat.com <mailto:jdarcy at redhat.com>>> wrote:
>      >
>      >       (Personally I would have
>      >     done this by "mixing in" the parent GFID to the hash
>     calculation, but
>      >     that alternative was ignored.)
>      >
>      >
>      > Actually when DHT was implemented, the concept of GFID did not (yet)
>      > exist. Due to backward compatibility it has just remained this
>     way even
>      > later. Including the GFID into the hash has benefits.
>
>     I am curious here as this is interesting.
>
>     So the layout start subvol assignment for a directory to be based on its
>     GFID was provided so that files with the same name distribute better
>     than ending up in the same bricks, right?
>
>
> Right, for e.g we wouldn't want all the README.txt in various
> directories of a volume to end up on the same server. The way it is
> achieved today is, the per server hash-range assignment is "rotated" by
> a certain amount (how much it is rotated is determined by a separate
> hash on the directory path) at the time of mkdir.
>
>     Instead as we _now_ have GFID, we could use that including the name to
>     get a similar/better distribution, or GFID+name to determine hashed
>     subvol.
>
> What we could do now is, include the parent directory gfid as an input
> into the DHT hash function.
>
> Today, we do approximately:
>    int hashval = dm_hash ("readme.txt")
>    hash_ranges[] = inode_ctx_get (parent_dir)
>    subvol = find_subvol (hash_ranges, hashval)
>
> Instead, we could:
>    int hashval = new_hash ("readme.txt", parent_dir.gfid)
>    hash_ranges[] = global_value
>    subvol = find_subvol (hash_ranges, hashval)
>
>     The idea here would be that on dentry creates we would need to generate
>     the GFID and not let the bricks generate the same, so that we can choose
>     the subvol to wind the FOP to.
>
>
> The GFID would be that of the parent (as an entry name is always in the
> context of a parent directory/inode). Also, the GFID for a new entry is
> already generated by the client, the brick does not generate a GFID.
>
>     This eliminates the need for a layout per sub-directory and all the
>     (interesting) problems that it comes with and instead can be replaced by
>     a layout at root. Not sure if it handles all use cases and paths that we
>     have now (which needs more understanding).
>
>     I do understand there is a backward compatibility issue here, but other
>     than this, this sounds better than the current scheme, as there is a
>     single layout to read/optimize/stash/etc. across clients.
>
>     Can I understand the rationale of this better, as to what you folks are
>     thinking. Am I missing something or over reading on the benefits that
>     this can provide?
>
>
> I think you understand it right. The benefit is one could have a single
> hash layout for the entire volume and the directory "specific-ness" is
> implemented by including the directory gfid into the hash function. The
> way I see it, the compromise would be something like:
>
> Pro per directory range: By having per-directory hash ranges, we can do
> easier incremental rebalance. Partial progress is well tolerated and
> does not impact the entire volume. The time a given directory is
> undergoing rebalance, for that directory alone we need to enter
> "unhashed lookup" mode, only for that period of time.
>
> Con per directory range: Just the new "hash assignment" phase (to impact
> placement of new files/data, not move old data) itself is an extended
> process, crawling the entire volume with complex per-directory
> operations. The number of points in the system where things can "break"
> (i.e, result in overlaps and holes in ranges) is high.
>
> Pro single layout with dir GFID in hash: Avoid the numerous parts
> (per-dir hash ranges) which can potentially "break".
>
> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
> new layout) is atomic for the entire volume - unhashed lookup has to be
> "on" for all dirs for the entire period. To mitigate this, we could
> explore versioning the centralized hash ranges, and store the version
> used by each directory in its xattrs (and update the version as the
> rebalance progresses). But now we have more centralized metadata (may
> be/ may not be a worthy compromise - not sure.)

Agreed, the auto-unhased would have to wait longer before being rearmed.

Just throwing some more thoughts on the same,

Unhashed-auto also can benefit from just linkto creations, rather than 
require a data rebalance (i.e movement of data). So in phase-0 we could 
just create the linkto files and then turn on auto-unhashed. As lookups 
would find the (linkto) file.

Other abilities, like giving directories weighted layout ranges based on 
size of bricks could be affected, i.e forcing a rebalance when a brick 
size is increased, as it would need a root layout change, rather than 
newly created directories getting the better weights.

>
> In summary, including GFID into the hash calculation does open up
> interesting possibilities and worthy of serious consideration.

Yes, something to consider for Gluster 4.0 (or earlier if done right 
with backward compatibility handled)

Thanks,
Shyam