[Gluster-devel] DHT-based Replication
Jeff Darcy
jdarcy at redhat.com
Tue Nov 22 16:14:42 UTC 2011
On Mon, 21 Nov 2011 21:34:30 -0200
Daniel van Ham Colchete <daniel.colchete at gmail.com> wrote:
> So, I have a suggestion that fixes this problem. I call it DHT-based
> replication. It is different from DHT-based distribution. I already
> implemented it internally, it already worked, at least here.
Wonderful. Patches would be most welcome.
> Giving
> the amount of money and energy this idea saves, I think this idea is
> worth a million bucks (on your hands at least). Even though it is
> really simple. I'm giving it to you guys for free. Just please give
> credit if you are going to use it.
>
> It is very simple: hash(name) to locate the primary server (as
> usual), and use another hash, like hash(name + "#copy2") for the
> second copy and so on. You just have to certify that it doesn't fall
> into the same server, but this is easy: hash(name + "#copy2/try2").
I've discussed this issue in some of my presentations, and there's a
whole range of possible approaches.
* Distribute across pairs (current approach). You did a good job
explaining some of the problems here.
* Overlapping pairs around a ring. This is possible, and has the
advantage that load for a failed server is distributed between its
two neighbors instead of one partner (and similarly for N>2).
Unfortunately, it's a bit incompatible with the way DHT currently
works, with per-directory layout maps and an assumption that a hashed
file's parent directory will exist on the same brick (essentially
meaning directories must exist *everywhere*). I have in the past
implemented an alternative DHT translator that uses a Dynamo-style
global ring and would be much more friendly to this kind of
replication, but it's nowhere near production quality. It also
requires a more dynamic translator-graph infrastructure, because it
would involve creating AFR translators (or their moral equivalent)
that aren't explicit in the volfile, including when bricks are added.
* Iterative hashing (your suggestion). This extends on both the
advantages and drawbacks of the previous approach. Because the
number of AFR combinations now grows exponentially, it also requires
that the translator-graph infrastructure be much more scalable as
well as more dynamic. BTW, Kademlia's XOR-based consistent hashing
offers similar characteristics to iterative ring-based hashing, and
might be preferable for reasons too academic to describe here.
As I'm sure you can see, there are a few technical issues that remain
before a simple idea can be turned into a workable reality. One
compromise I've considered is to assign multiple layout maps per
directory, hashing a file first to a map and then to a position within
that map. This would provide practically the same load-spreading
advantage of the other approaches, with only slight change to existing
(and working) code.
More information about the Gluster-devel
mailing list