[Gluster-devel] EHT / DHT

Tue Nov 11 21:56:01 UTC 2014

> > I was wondering, is there a way to change / parameter to pass to
> > clusters DHT to change the distribution algorithm to only take into
> > account filename and not the preceding filesystem path?
> > i.e when a file is at: /mount/gluster/directory/filename.ext
> > To only hash on “filename.ext” ?
> 
> Currently DHT hashes the file name and not the entire path.
> 
> See, callers of dht_hash_compute in source (pretty much
> dht_layout_search) to which loc->name is passed, which is the file name
> and not the entire path.

While that is true, there are a couple of caveats.  First, the hash is
based on the file name (last path component) but the *distribution* for
each directory (what we call a layout) is modified based on the
directory GFID.  This prevents the same file name in different
directories always hashing to the same brick.  (Personally I would have
done this by "mixing in" the parent GFID to the hash calculation, but
that alternative was ignored.)

Secondly, there is a way to modify the hashing.  If you set the
"cluster.extra-hash-regex" option on a volume, that regular expression
will be used to "pick apart" the file name into a part that's used for
hashing and a part that's ignored.  Consider the case of rsync, which
for a file XXX will create a temporary file .XXX.123456 and rely on the
semantics of rename(2) to move it into place only after it's fully
written.  The "rsync-hash-regex" is already set up to remove the leading
"." and trailing ".123456" so that "XXX" is again the effective name for
hashing/distribution purposes.  This allows the later rename to be done
on one brick every time, which improves performance significantly.  With
"extra-hash-regex" you can do the same thing for a second app, without
affecting the rsync behavior.