[Gluster-devel] EHT / DHT
Jeff Darcy
jdarcy at redhat.com
Tue Nov 11 21:56:01 UTC 2014
> > I was wondering, is there a way to change / parameter to pass to
> > clusters DHT to change the distribution algorithm to only take into
> > account filename and not the preceding filesystem path?
> > i.e when a file is at: /mount/gluster/directory/filename.ext
> > To only hash on “filename.ext” ?
>
> Currently DHT hashes the file name and not the entire path.
>
> See, callers of dht_hash_compute in source (pretty much
> dht_layout_search) to which loc->name is passed, which is the file name
> and not the entire path.
While that is true, there are a couple of caveats. First, the hash is
based on the file name (last path component) but the *distribution* for
each directory (what we call a layout) is modified based on the
directory GFID. This prevents the same file name in different
directories always hashing to the same brick. (Personally I would have
done this by "mixing in" the parent GFID to the hash calculation, but
that alternative was ignored.)
Secondly, there is a way to modify the hashing. If you set the
"cluster.extra-hash-regex" option on a volume, that regular expression
will be used to "pick apart" the file name into a part that's used for
hashing and a part that's ignored. Consider the case of rsync, which
for a file XXX will create a temporary file .XXX.123456 and rely on the
semantics of rename(2) to move it into place only after it's fully
written. The "rsync-hash-regex" is already set up to remove the leading
"." and trailing ".123456" so that "XXX" is again the effective name for
hashing/distribution purposes. This allows the later rename to be done
on one brick every time, which improves performance significantly. With
"extra-hash-regex" you can do the same thing for a second app, without
affecting the rsync behavior.
More information about the Gluster-devel
mailing list