[Gluster-devel] [RFC] Improved distribution

Tue Apr 17 14:38:24 UTC 2012

On Tue, 17 Apr 2012 01:57:35 +0200
Edward Shishkin <edward at redhat.com> wrote:

> Comment 2. There is a disadvantage: in this approach all files
> 
> /foo
> /dir1/foo
> /dir1/dir2/foo
> ...
> 
> will be accumulated on the same brick. However it is possible to
> "salt" a short file names with gfid (or another id) of respective
> directory before hashing, to avoid possible attacks.

That's the easy problem.  The harder problem is that the "only split
one brick" approach creates imbalances that are *permanent and
accumulative*.  In other words, each change is likely to take us
further from optimal distribution, so that measures such as virtual
nodes become strictly necessary to preserve any semblance of proper
load/capacity balancing.  I had implemented an approach based on moving
files only from one hash range instead of only from one brick (a
demonstrably superior generalization of the idea) and it still exhibits
this behavior.  You had those results before you ever wrote this
proposal.  We still need a rebalance method that restores "perfect"
distribution, even if it's not the only one we use.

>                        2. Virtual nodes

Virtual nodes are a bad idea.  Even the people who included them in
Dynamo design have said as much since.  The main problem is that
the accumulate infinitely.  If one relies on them too much to fix other
problems, the result is very large numbers of virtual node IDs and very
large lookup tables.  This is compounded in our case by the fact that
information about node IDs or ranges is contained in xattrs, so adding
too many will make fetching that information (a frequent operation)
less efficient.  At the very least, virtual node IDs need to be
aggressively pruned, even if that means incurring some data-movement
cost.  Even better, I think we should stay away from them entirely.  My
favorite alternative is multiple hash rings, as follows:

	ring_hash = hash(file_id)
	ring_to_use = ring_hash % num_rings
	node_hash = hash(ring_hash/num_rings)
	node_to_use = lookup(node_hash,lookup_table[ring_to_use])

This approach rapidly approaches the same flexibility/efficiency as
virtual node IDs, with quite small values for num_rings.  With careful
assignment of ranges within each ring, it can also assure that the load
when a node fails is spread out across up to num_rings successors
instead of just one (as with a single ring even when using virtual
nodes).

> To achieve high availability and durability we replicate files on
> multiple bricks. In our case replication can be implemented as a set
> of operations with the same ring R, so we don't create a separate
> translator for replication.

Yes, we do and we will.  Distribution and replication are both
extremely complex.  Our code for both represents years of accumulated
expertise handling all manner of difficult cases, so for basic
modularity/maintainability reasons they will remain separate.

That said, a more nuanced relationship between distribution sets and
replica sets would be welcome.  If we allowed replication across
arbitrary (and possibly overlapping) sets of R nodes instead of
statically partitioning the bricks into 1/R subvolumes of DHT, we'd
gain at least the following benefits.

(1) Ability to support multiple replication levels within a single
volume.

(2) Smoother addition and removal of bricks.

(3) Better distribution of load from a failed brick.

Unfortunately, this requires that DHT be able to "spawn" AFR
translators dynamically, without them being represented directly in the
volfile.  I've written code to do this, verified that it actually
works (or at least did at that time), and published the results to the
cloudfs-devel mailing list.  It still requires a lot of work to make
sure that option changes and other kinds of reconfiguration are handled
correctly, but long term it's the direction we need to go.
>                              APPENDIX
> 
> 
> 
> ----------------------------------------------------------------------
> 
> In 3 distributed hash tables with different hashing techniques
> 
> . GlusterFS DHT translator (3.2.5)
> . 64-bit ring with phi based on md5, R=1 (no replication), S=7
> . 64-bit ring with phi based on md5, R=1 (no replication), S=20
> 
> we run the same scenario:
> 
> 1) Create 100 files ("file00", "file01", ..., "file99") in a volume
>     composed of 9 bricks:
> 
>     "host:/root/exp0",
>     "host:/root/exp1",
>     ...
> 
>     "host:/root/exp8".
> 
> 2) Add one brick "host:/root/exp9";
> 3) re-balance;

These results are barely usable.  When I pushed you to write up and
distribute this proposal instead of just beginning to hack on the code,
I also provided you with scripts that apply different rebalancing
methods and measure the results across different scenarios to generate
highly readable tables showing the data-movement effects.  Why did you
use a weaker methodology generating less readable results?

Also, please use code from master/3.3 for your testing and
development.  I don't think there have been any significant changes in
this particular area, but one should always compare results to the
version one proposes to replace.