[Gluster-devel] DHT-based Replication

Daniel van Ham Colchete daniel.colchete at gmail.com
Mon Nov 21 23:34:30 UTC 2011

Good day everyone!

It is a pleasure for me to write to this list again, after so many years. I
tested GlusterFS a lot about four years ago because of a project I still
have today. It was 2007 and I worked with Gluster 1.2 and 1.3. I like to
think that I gave one or two good ideas that were used in the project. Amar
was a great friend back then (friend is the correct word as he was on my
Orkut's friends list). Unfortunately Gluster's performance wasn't good
enough, mostly because of the ASCII protocol and I had to go to the HA-NFS

By that time, a new clustering translator was created, it was called DHT.
In those four years that passed I studied DHT and even implemented a
DHT-based internal system here at my company to create a PostgreSQL cluster
(I didn't do any reconciliation thing though, that is the hard part for
me). It worked fine, it still does!

What is the beauty of DHT? It is easy to find where a file or a directory
is in the cluster. It is just a hash operation! You just create a cluster
of N servers and everything should be distributed evenly through them.
Every player out there got the distribution part of storage clusters right:
GlusterFS, Cassandra, MongoDB, etc.

What is not right in my opinion is how everybody does replication.
Everything works in pairs (or triples, or replica sets of any number). The
problem with pairs (and replica sets) is that in a 10 server cluster, if
one fails only its pair will have to handle the double of the usual load,
while the other ones will be working with the usual load. So we can assign
at most 50% of its capacity to any storage server. Only 50%. 50% efficiency
means we spend the double in hardware, rack space and energy (the worst
part). Even with RAID10 storage, the ones we use, have this problem too. If
a disk dies, one will get double read IOs while the other ones are 50% idle
(or should be).

So, I have a suggestion that fixes this problem. I call it DHT-based
replication. It is different from DHT-based distribution. I already
implemented it internally, it already worked, at least here. Giving the
amount of money and energy this idea saves, I think this idea is worth a
million bucks (on your hands at least). Even though it is really simple.
I'm giving it to you guys for free. Just please give credit if you are
going to use it.

It is very simple: hash(name) to locate the primary server (as usual), and
use another hash, like hash(name + "#copy2") for the second copy and so on.
You just have to certify that it doesn't fall into the same server, but
this is easy: hash(name + "#copy2/try2").

So, say we have 11 storage servers. Yeah, now we can have a prime number of
servers and everything will still work fine. Imagine that storage server
#11 dies. The secondary copy of all files at #11 are spread across all the
other 10 servers, so those servers are now only getting a load 10% bigger!
Wow! Now I can use 90% of what my storages can handle and still be up when
one fails! Had I done this the way things are today, my system would be
down because there is no way a storage can handle 180% of IO its capacity
(by definition). So now I have 90% efficiency! For exactly the same costs
to have 50k users I can have 90k users now! That is a 45% costs savings on
storage, with just a simple algorithm change.

So, what are your thoughts on the idea?

Thank you very much!

Best regards,
Daniel Colchete
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20111121/d2d17bab/attachment-0003.html>

More information about the Gluster-devel mailing list