[Gluster-devel] DHT-based Replication
ian.latter at midnightcode.org
Tue Nov 22 06:25:07 UTC 2011
I think what you're describing is functionally equivalent
to the Google File System;
And, by comparable design, Hadoop File System;
Google/Hadoop FS framed a set of principles that I
(and a friend of mine - Michael Welton) were hoping to
achieve by moving to GlusterFS - and I think it will still
get us all the way there once we get IPSEC underneath
it and it deals with WANs better.
In that context, I'm a fan already :)
However, I would like to add that if you're going to go
to the trouble of providing a hashed-block abstraction
layer, that you make it pluggable. I.e. before I was taken
ill in September 2009 I was looking at implementing a
de-dupe block for GlusterFS. Another friend (Chris
Kelada) and I came up with a full de-dupe model one
evening which I thought was quite simple. What stalled
me before I fell ill was an issue with descriptors that may
have been fixed during the BSD porting activity that I saw
on this list a week or two ago.
De-dupe needs a hash for a chunk of data. What we
proposed was an arbitrary sized chunk (say 1Mbyte)
hashed, then stored in the underlying layer via its hash -
i.e. if my block hashed to "thisismyhash" then it would be
stored in /.dedupe/thi/sis/myh/ash/data.bin with provision
for an index to allow multiple blocks of the same hash to
co-exist at the same location (i.e. either n x chunk_size
offset into the data file, or n data files).
This would have a high read/write performance penalty
(lots of iops in the underlying layer per iop in the de-dupe
layer) but it reaps the obvious space benefit.
Thus if a data hashing block were to be forged, its re-
application could be quite broad - but only if considered
beforehand; then I'd be all for it.
----- Original Message -----
>From: "Daniel van Ham Colchete" <daniel.colchete at gmail.com>
>To: "glusterfs" <gluster-devel at nongnu.org>
>Subject: [Gluster-devel] DHT-based Replication
>Date: Mon, 21 Nov 2011 21:34:30 -0200
> Good day everyone!
> It is a pleasure for me to write to this list again, after
so many years. I
> tested GlusterFS a lot about four years ago because of a
project I still
> have today. It was 2007 and I worked with Gluster 1.2 and
1.3. I like to
> think that I gave one or two good ideas that were used in
the project. Amar
> was a great friend back then (friend is the correct word
as he was on my
> Orkut's friends list). Unfortunately Gluster's performance
> enough, mostly because of the ASCII protocol and I had to
go to the HA-NFS
> By that time, a new clustering translator was created, it
was called DHT.
> In those four years that passed I studied DHT and even
> DHT-based internal system here at my company to create a
> (I didn't do any reconciliation thing though, that is the
hard part for
> me). It worked fine, it still does!
> What is the beauty of DHT? It is easy to find where a file
or a directory
> is in the cluster. It is just a hash operation! You just
create a cluster
> of N servers and everything should be distributed evenly
> Every player out there got the distribution part of
storage clusters right:
> GlusterFS, Cassandra, MongoDB, etc.
> What is not right in my opinion is how everybody does
> Everything works in pairs (or triples, or replica sets of
any number). The
> problem with pairs (and replica sets) is that in a 10
server cluster, if
> one fails only its pair will have to handle the double of
the usual load,
> while the other ones will be working with the usual load.
So we can assign
> at most 50% of its capacity to any storage server. Only
50%. 50% efficiency
> means we spend the double in hardware, rack space and
energy (the worst
> part). Even with RAID10 storage, the ones we use, have
this problem too. If
> a disk dies, one will get double read IOs while the other
ones are 50% idle
> (or should be).
> So, I have a suggestion that fixes this problem. I call it
> replication. It is different from DHT-based distribution.
> implemented it internally, it already worked, at least
here. Giving the
> amount of money and energy this idea saves, I think this
idea is worth a
> million bucks (on your hands at least). Even though it is
> I'm giving it to you guys for free. Just please give
credit if you are
> going to use it.
> It is very simple: hash(name) to locate the primary server
(as usual), and
> use another hash, like hash(name + "#copy2") for the
second copy and so on.
> You just have to certify that it doesn't fall into the
same server, but
> this is easy: hash(name + "#copy2/try2").
> So, say we have 11 storage servers. Yeah, now we can have
a prime number of
> servers and everything will still work fine. Imagine that
> #11 dies. The secondary copy of all files at #11 are
spread across all the
> other 10 servers, so those servers are now only getting a
load 10% bigger!
> Wow! Now I can use 90% of what my storages can handle and
still be up when
> one fails! Had I done this the way things are today, my
system would be
> down because there is no way a storage can handle 180% of
IO its capacity
> (by definition). So now I have 90% efficiency! For exactly
the same costs
> to have 50k users I can have 90k users now! That is a 45%
costs savings on
> storage, with just a simple algorithm change.
> So, what are your thoughts on the idea?
> Thank you very much!
> Best regards,
> Daniel Colchete
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
Late night coder ..
More information about the Gluster-devel