[Gluster-users] dht hashing based on basename only?

Tue Aug 14 13:40:02 UTC 2012

On 08/14/2012 03:44 AM, Jochen Klein wrote:
> Looking at the implementation in the dht translator and checking
> calculated hashes it seems that only the basename is used for the hash
> calculation of a given file. With all directories having the same
> mappings for the hash intervals to bricks, this would explain our
> observation if only this file hash is used. However, I also see hashes
> calculated for directories but it's not clear to me for what they are
> used?
> 
> Do I miss something here? Is this behaviour intended? Is there a
> (supported) way to still distribute the files homogeneously to all
> bricks? E.g. by using the full path for the hashing (which is actually
> what I understood from the manual), or by shuffling the hash intervals
> per directory?

I tripped over the same issue a while ago.  Yes, the file hashes use only the
basename.  However, it's not (or at least shouldn't be) true that all
directories have the same mappings for the hash intervals.  The same ranges are
used, but rotated into different orders.  So, using letters for hash values,
different directories might have:

	A-I on brick1, J-R on brick2, S-Z on brick3
	A-I on brick2, J-R on brick3, S-Z on brick1
	A-I on brick3, J-R on brick1, S-Z on brick2

I just ran a quick test creating a bunch of directories on a simple two-brick
distributed volume.  Sure enough, about half of the directories got one order,
and the other half got another.  If this isn't working the same way for you
(check using "getxattr -x -n trusted.glusterfs.dht" on each per-brick copy of
each directory) then it's probably a bug and we'll have to figure out why.

Personally, I'd prefer if all directories *did* have the same hash layout, so
that those layouts could be inherited instead of having to be set separately on
each and every directory of a potentially-petabyte volume.  That would require
that the hash include some directory-specific value (such as the parent GFID)
as well as the basename, but that seems a small price to pay.  In other words,
though right now it's a bit non-obvious how the layouts and hashing work, some
day they might work as you (and I) had expected.