[Gluster-devel] Single layout at root (Was EHT / DHT)
Shyam
srangana at redhat.com
Wed Nov 26 01:15:58 UTC 2014
On 11/25/2014 05:03 PM, Anand Avati wrote:
>
>
> On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana at redhat.com
> <mailto:srangana at redhat.com>> wrote:
>
> On 11/12/2014 01:55 AM, Anand Avati wrote:
> >
> >
> > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy at redhat.com
> <mailto:jdarcy at redhat.com>
> > <mailto:jdarcy at redhat.com <mailto:jdarcy at redhat.com>>> wrote:
> >
> > (Personally I would have
> > done this by "mixing in" the parent GFID to the hash
> calculation, but
> > that alternative was ignored.)
> >
> >
> > Actually when DHT was implemented, the concept of GFID did not (yet)
> > exist. Due to backward compatibility it has just remained this
> way even
> > later. Including the GFID into the hash has benefits.
>
> I am curious here as this is interesting.
>
> So the layout start subvol assignment for a directory to be based on its
> GFID was provided so that files with the same name distribute better
> than ending up in the same bricks, right?
>
>
> Right, for e.g we wouldn't want all the README.txt in various
> directories of a volume to end up on the same server. The way it is
> achieved today is, the per server hash-range assignment is "rotated" by
> a certain amount (how much it is rotated is determined by a separate
> hash on the directory path) at the time of mkdir.
>
> Instead as we _now_ have GFID, we could use that including the name to
> get a similar/better distribution, or GFID+name to determine hashed
> subvol.
>
> What we could do now is, include the parent directory gfid as an input
> into the DHT hash function.
>
> Today, we do approximately:
> int hashval = dm_hash ("readme.txt")
> hash_ranges[] = inode_ctx_get (parent_dir)
> subvol = find_subvol (hash_ranges, hashval)
>
> Instead, we could:
> int hashval = new_hash ("readme.txt", parent_dir.gfid)
> hash_ranges[] = global_value
> subvol = find_subvol (hash_ranges, hashval)
>
> The idea here would be that on dentry creates we would need to generate
> the GFID and not let the bricks generate the same, so that we can choose
> the subvol to wind the FOP to.
>
>
> The GFID would be that of the parent (as an entry name is always in the
> context of a parent directory/inode). Also, the GFID for a new entry is
> already generated by the client, the brick does not generate a GFID.
>
> This eliminates the need for a layout per sub-directory and all the
> (interesting) problems that it comes with and instead can be replaced by
> a layout at root. Not sure if it handles all use cases and paths that we
> have now (which needs more understanding).
>
> I do understand there is a backward compatibility issue here, but other
> than this, this sounds better than the current scheme, as there is a
> single layout to read/optimize/stash/etc. across clients.
>
> Can I understand the rationale of this better, as to what you folks are
> thinking. Am I missing something or over reading on the benefits that
> this can provide?
>
>
> I think you understand it right. The benefit is one could have a single
> hash layout for the entire volume and the directory "specific-ness" is
> implemented by including the directory gfid into the hash function. The
> way I see it, the compromise would be something like:
>
> Pro per directory range: By having per-directory hash ranges, we can do
> easier incremental rebalance. Partial progress is well tolerated and
> does not impact the entire volume. The time a given directory is
> undergoing rebalance, for that directory alone we need to enter
> "unhashed lookup" mode, only for that period of time.
>
> Con per directory range: Just the new "hash assignment" phase (to impact
> placement of new files/data, not move old data) itself is an extended
> process, crawling the entire volume with complex per-directory
> operations. The number of points in the system where things can "break"
> (i.e, result in overlaps and holes in ranges) is high.
>
> Pro single layout with dir GFID in hash: Avoid the numerous parts
> (per-dir hash ranges) which can potentially "break".
>
> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
> new layout) is atomic for the entire volume - unhashed lookup has to be
> "on" for all dirs for the entire period. To mitigate this, we could
> explore versioning the centralized hash ranges, and store the version
> used by each directory in its xattrs (and update the version as the
> rebalance progresses). But now we have more centralized metadata (may
> be/ may not be a worthy compromise - not sure.)
Agreed, the auto-unhased would have to wait longer before being rearmed.
Just throwing some more thoughts on the same,
Unhashed-auto also can benefit from just linkto creations, rather than
require a data rebalance (i.e movement of data). So in phase-0 we could
just create the linkto files and then turn on auto-unhashed. As lookups
would find the (linkto) file.
Other abilities, like giving directories weighted layout ranges based on
size of bricks could be affected, i.e forcing a rebalance when a brick
size is increased, as it would need a root layout change, rather than
newly created directories getting the better weights.
>
> In summary, including GFID into the hash calculation does open up
> interesting possibilities and worthy of serious consideration.
Yes, something to consider for Gluster 4.0 (or earlier if done right
with backward compatibility handled)
Thanks,
Shyam
More information about the Gluster-devel
mailing list