[Gluster-devel] DHT-based Replication

Tue Nov 22 16:14:42 UTC 2011

On Mon, 21 Nov 2011 21:34:30 -0200
Daniel van Ham Colchete <daniel.colchete at gmail.com> wrote:

> So, I have a suggestion that fixes this problem. I call it DHT-based
> replication. It is different from DHT-based distribution. I already
> implemented it internally, it already worked, at least here.

Wonderful.  Patches would be most welcome.

> Giving
> the amount of money and energy this idea saves, I think this idea is
> worth a million bucks (on your hands at least). Even though it is
> really simple. I'm giving it to you guys for free. Just please give
> credit if you are going to use it.
> 
> It is very simple: hash(name) to locate the primary server (as
> usual), and use another hash, like hash(name + "#copy2") for the
> second copy and so on. You just have to certify that it doesn't fall
> into the same server, but this is easy: hash(name + "#copy2/try2").

I've discussed this issue in some of my presentations, and there's a
whole range of possible approaches.

* Distribute across pairs (current approach).  You did a good job
  explaining some of the problems here.

* Overlapping pairs around a ring.  This is possible, and has the
  advantage that load for a failed server is distributed between its
  two neighbors instead of one partner (and similarly for N>2).
  Unfortunately, it's a bit incompatible with the way DHT currently
  works, with per-directory layout maps and an assumption that a hashed
  file's parent directory will exist on the same brick (essentially
  meaning directories must exist *everywhere*).  I have in the past
  implemented an alternative DHT translator that uses a Dynamo-style
  global ring and would be much more friendly to this kind of
  replication, but it's nowhere near production quality.  It also
  requires a more dynamic translator-graph infrastructure, because it
  would involve creating AFR translators (or their moral equivalent)
  that aren't explicit in the volfile, including when bricks are added.

* Iterative hashing (your suggestion).  This extends on both the
  advantages and drawbacks of the previous approach.  Because the
  number of AFR combinations now grows exponentially, it also requires
  that the translator-graph infrastructure be much more scalable as
  well as more dynamic.  BTW, Kademlia's XOR-based consistent hashing
  offers similar characteristics to iterative ring-based hashing, and
  might be preferable for reasons too academic to describe here.

As I'm sure you can see, there are a few technical issues that remain
before a simple idea can be turned into a workable reality.  One
compromise I've considered is to assign multiple layout maps per
directory, hashing a file first to a map and then to a position within
that map.  This would provide practically the same load-spreading
advantage of the other approaches, with only slight change to existing
(and working) code.