[Gluster-devel] Single layout at root (Was EHT / DHT)
Jan H Holtzhausen
janh at holtztech.info
Wed Nov 26 07:50:00 UTC 2014
OK, no current DHT workaround…
Wasn’t there a xlator that would tend to put files on the local brick
(maybe with NFS mount)?
BR
Jan
On 2014/11/26, 1:15 AM, "Shyam" <srangana at redhat.com> wrote:
>On 11/25/2014 05:03 PM, Anand Avati wrote:
>>
>>
>> On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana at redhat.com
>> <mailto:srangana at redhat.com>> wrote:
>>
>> On 11/12/2014 01:55 AM, Anand Avati wrote:
>> >
>> >
>> > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy at redhat.com
>> <mailto:jdarcy at redhat.com>
>> > <mailto:jdarcy at redhat.com <mailto:jdarcy at redhat.com>>> wrote:
>> >
>> > (Personally I would have
>> > done this by "mixing in" the parent GFID to the hash
>> calculation, but
>> > that alternative was ignored.)
>> >
>> >
>> > Actually when DHT was implemented, the concept of GFID did not
>>(yet)
>> > exist. Due to backward compatibility it has just remained this
>> way even
>> > later. Including the GFID into the hash has benefits.
>>
>> I am curious here as this is interesting.
>>
>> So the layout start subvol assignment for a directory to be based
>>on its
>> GFID was provided so that files with the same name distribute better
>> than ending up in the same bricks, right?
>>
>>
>> Right, for e.g we wouldn't want all the README.txt in various
>> directories of a volume to end up on the same server. The way it is
>> achieved today is, the per server hash-range assignment is "rotated" by
>> a certain amount (how much it is rotated is determined by a separate
>> hash on the directory path) at the time of mkdir.
>>
>> Instead as we _now_ have GFID, we could use that including the name
>>to
>> get a similar/better distribution, or GFID+name to determine hashed
>> subvol.
>>
>> What we could do now is, include the parent directory gfid as an input
>> into the DHT hash function.
>>
>> Today, we do approximately:
>> int hashval = dm_hash ("readme.txt")
>> hash_ranges[] = inode_ctx_get (parent_dir)
>> subvol = find_subvol (hash_ranges, hashval)
>>
>> Instead, we could:
>> int hashval = new_hash ("readme.txt", parent_dir.gfid)
>> hash_ranges[] = global_value
>> subvol = find_subvol (hash_ranges, hashval)
>>
>> The idea here would be that on dentry creates we would need to
>>generate
>> the GFID and not let the bricks generate the same, so that we can
>>choose
>> the subvol to wind the FOP to.
>>
>>
>> The GFID would be that of the parent (as an entry name is always in the
>> context of a parent directory/inode). Also, the GFID for a new entry is
>> already generated by the client, the brick does not generate a GFID.
>>
>> This eliminates the need for a layout per sub-directory and all the
>> (interesting) problems that it comes with and instead can be
>>replaced by
>> a layout at root. Not sure if it handles all use cases and paths
>>that we
>> have now (which needs more understanding).
>>
>> I do understand there is a backward compatibility issue here, but
>>other
>> than this, this sounds better than the current scheme, as there is a
>> single layout to read/optimize/stash/etc. across clients.
>>
>> Can I understand the rationale of this better, as to what you folks
>>are
>> thinking. Am I missing something or over reading on the benefits
>>that
>> this can provide?
>>
>>
>> I think you understand it right. The benefit is one could have a single
>> hash layout for the entire volume and the directory "specific-ness" is
>> implemented by including the directory gfid into the hash function. The
>> way I see it, the compromise would be something like:
>>
>> Pro per directory range: By having per-directory hash ranges, we can do
>> easier incremental rebalance. Partial progress is well tolerated and
>> does not impact the entire volume. The time a given directory is
>> undergoing rebalance, for that directory alone we need to enter
>> "unhashed lookup" mode, only for that period of time.
>>
>> Con per directory range: Just the new "hash assignment" phase (to impact
>> placement of new files/data, not move old data) itself is an extended
>> process, crawling the entire volume with complex per-directory
>> operations. The number of points in the system where things can "break"
>> (i.e, result in overlaps and holes in ranges) is high.
>>
>> Pro single layout with dir GFID in hash: Avoid the numerous parts
>> (per-dir hash ranges) which can potentially "break".
>>
>> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
>> new layout) is atomic for the entire volume - unhashed lookup has to be
>> "on" for all dirs for the entire period. To mitigate this, we could
>> explore versioning the centralized hash ranges, and store the version
>> used by each directory in its xattrs (and update the version as the
>> rebalance progresses). But now we have more centralized metadata (may
>> be/ may not be a worthy compromise - not sure.)
>
>Agreed, the auto-unhased would have to wait longer before being rearmed.
>
>Just throwing some more thoughts on the same,
>
>Unhashed-auto also can benefit from just linkto creations, rather than
>require a data rebalance (i.e movement of data). So in phase-0 we could
>just create the linkto files and then turn on auto-unhashed. As lookups
>would find the (linkto) file.
>
>Other abilities, like giving directories weighted layout ranges based on
>size of bricks could be affected, i.e forcing a rebalance when a brick
>size is increased, as it would need a root layout change, rather than
>newly created directories getting the better weights.
>
>>
>> In summary, including GFID into the hash calculation does open up
>> interesting possibilities and worthy of serious consideration.
>
>Yes, something to consider for Gluster 4.0 (or earlier if done right
>with backward compatibility handled)
>
>Thanks,
>Shyam
>_______________________________________________
>Gluster-devel mailing list
>Gluster-devel at gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list