[Gluster-devel] Single layout at root (Was EHT / DHT)

Wed Nov 26 07:50:00 UTC 2014

OK, no current DHT workaround…
Wasn’t there a xlator that would tend to put files on the local brick 
(maybe with NFS mount)?

BR
Jan

On 2014/11/26, 1:15 AM, "Shyam" <srangana at redhat.com> wrote:

>On 11/25/2014 05:03 PM, Anand Avati wrote:
>>
>>
>> On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana at redhat.com
>> <mailto:srangana at redhat.com>> wrote:
>>
>>     On 11/12/2014 01:55 AM, Anand Avati wrote:
>>      >
>>      >
>>      > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy at redhat.com
>>     <mailto:jdarcy at redhat.com>
>>      > <mailto:jdarcy at redhat.com <mailto:jdarcy at redhat.com>>> wrote:
>>      >
>>      >       (Personally I would have
>>      >     done this by "mixing in" the parent GFID to the hash
>>     calculation, but
>>      >     that alternative was ignored.)
>>      >
>>      >
>>      > Actually when DHT was implemented, the concept of GFID did not 
>>(yet)
>>      > exist. Due to backward compatibility it has just remained this
>>     way even
>>      > later. Including the GFID into the hash has benefits.
>>
>>     I am curious here as this is interesting.
>>
>>     So the layout start subvol assignment for a directory to be based 
>>on its
>>     GFID was provided so that files with the same name distribute better
>>     than ending up in the same bricks, right?
>>
>>
>> Right, for e.g we wouldn't want all the README.txt in various
>> directories of a volume to end up on the same server. The way it is
>> achieved today is, the per server hash-range assignment is "rotated" by
>> a certain amount (how much it is rotated is determined by a separate
>> hash on the directory path) at the time of mkdir.
>>
>>     Instead as we _now_ have GFID, we could use that including the name 
>>to
>>     get a similar/better distribution, or GFID+name to determine hashed
>>     subvol.
>>
>> What we could do now is, include the parent directory gfid as an input
>> into the DHT hash function.
>>
>> Today, we do approximately:
>>    int hashval = dm_hash ("readme.txt")
>>    hash_ranges[] = inode_ctx_get (parent_dir)
>>    subvol = find_subvol (hash_ranges, hashval)
>>
>> Instead, we could:
>>    int hashval = new_hash ("readme.txt", parent_dir.gfid)
>>    hash_ranges[] = global_value
>>    subvol = find_subvol (hash_ranges, hashval)
>>
>>     The idea here would be that on dentry creates we would need to 
>>generate
>>     the GFID and not let the bricks generate the same, so that we can 
>>choose
>>     the subvol to wind the FOP to.
>>
>>
>> The GFID would be that of the parent (as an entry name is always in the
>> context of a parent directory/inode). Also, the GFID for a new entry is
>> already generated by the client, the brick does not generate a GFID.
>>
>>     This eliminates the need for a layout per sub-directory and all the
>>     (interesting) problems that it comes with and instead can be 
>>replaced by
>>     a layout at root. Not sure if it handles all use cases and paths 
>>that we
>>     have now (which needs more understanding).
>>
>>     I do understand there is a backward compatibility issue here, but 
>>other
>>     than this, this sounds better than the current scheme, as there is a
>>     single layout to read/optimize/stash/etc. across clients.
>>
>>     Can I understand the rationale of this better, as to what you folks 
>>are
>>     thinking. Am I missing something or over reading on the benefits 
>>that
>>     this can provide?
>>
>>
>> I think you understand it right. The benefit is one could have a single
>> hash layout for the entire volume and the directory "specific-ness" is
>> implemented by including the directory gfid into the hash function. The
>> way I see it, the compromise would be something like:
>>
>> Pro per directory range: By having per-directory hash ranges, we can do
>> easier incremental rebalance. Partial progress is well tolerated and
>> does not impact the entire volume. The time a given directory is
>> undergoing rebalance, for that directory alone we need to enter
>> "unhashed lookup" mode, only for that period of time.
>>
>> Con per directory range: Just the new "hash assignment" phase (to impact
>> placement of new files/data, not move old data) itself is an extended
>> process, crawling the entire volume with complex per-directory
>> operations. The number of points in the system where things can "break"
>> (i.e, result in overlaps and holes in ranges) is high.
>>
>> Pro single layout with dir GFID in hash: Avoid the numerous parts
>> (per-dir hash ranges) which can potentially "break".
>>
>> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
>> new layout) is atomic for the entire volume - unhashed lookup has to be
>> "on" for all dirs for the entire period. To mitigate this, we could
>> explore versioning the centralized hash ranges, and store the version
>> used by each directory in its xattrs (and update the version as the
>> rebalance progresses). But now we have more centralized metadata (may
>> be/ may not be a worthy compromise - not sure.)
>
>Agreed, the auto-unhased would have to wait longer before being rearmed.
>
>Just throwing some more thoughts on the same,
>
>Unhashed-auto also can benefit from just linkto creations, rather than 
>require a data rebalance (i.e movement of data). So in phase-0 we could 
>just create the linkto files and then turn on auto-unhashed. As lookups 
>would find the (linkto) file.
>
>Other abilities, like giving directories weighted layout ranges based on 
>size of bricks could be affected, i.e forcing a rebalance when a brick 
>size is increased, as it would need a root layout change, rather than 
>newly created directories getting the better weights.
>
>>
>> In summary, including GFID into the hash calculation does open up
>> interesting possibilities and worthy of serious consideration.
>
>Yes, something to consider for Gluster 4.0 (or earlier if done right 
>with backward compatibility handled)
>
>Thanks,
>Shyam
>_______________________________________________
>Gluster-devel mailing list
>Gluster-devel at gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-devel