[Gluster-devel] DHT-based Replication

Daniel van Ham Colchete daniel.colchete at gmail.com
Tue Nov 22 22:15:48 UTC 2011


Hi Ian!

You see, there isn't much why we shouldn't do it! There is the Google folks
and Hadoop folks having this kind of efficiency and we dont! With 100
storages you can have N+5 redundancy and still be 94,7% IO efficient,
considering the big clusters out there.

Having that at GlusterFS would mean a stable high performance filesystem
getting much more efficient.

Best!

On Tue, Nov 22, 2011 at 4:25 AM, Ian Latter <ian.latter at midnightcode.org>wrote:

> Hello,
>
>
>  I think what you're describing is functionally equivalent
> to the Google File System;
>    http://labs.google.com/papers/gfs.html
>    http://en.wikipedia.org/wiki/Google_File_System
>
>  And, by comparable design, Hadoop File System;
>
> http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
>
>  Google/Hadoop FS framed a set of principles that I
> (and a friend of mine - Michael Welton) were hoping to
> achieve by moving to GlusterFS - and I think it will still
> get us all the way there once we get IPSEC underneath
> it and it deals with WANs better.
>
>  In that context, I'm a fan already :)
>
>  However, I would like to add that if you're going to go
> to the trouble of providing a hashed-block abstraction
> layer, that you make it pluggable.  I.e.  before I was taken
> ill in September 2009 I was looking at implementing a
> de-dupe block for GlusterFS.  Another friend (Chris
> Kelada) and I came up with a full de-dupe model one
> evening which I thought was quite simple.  What stalled
> me before I fell ill was an issue with descriptors that may
> have been fixed during the BSD porting activity that I saw
> on this list a week or two ago.
>  De-dupe needs a hash for a chunk of data.  What we
> proposed was an arbitrary sized chunk (say 1Mbyte)
> hashed, then stored in the underlying layer via its hash -
> i.e. if my block hashed to "thisismyhash" then it would be
> stored in /.dedupe/thi/sis/myh/ash/data.bin with provision
> for an index to allow multiple blocks of the same hash to
> co-exist at the same location (i.e. either n x chunk_size
> offset into the data file, or n data files).
>
>  This would have a high read/write performance penalty
> (lots of iops in the underlying layer per iop in the de-dupe
> layer) but it reaps the obvious space benefit.
>
>  Thus if a data hashing block were to be forged, its re-
> application could be quite broad - but only if considered
> beforehand; then I'd be all for it.
>
>
>
> Cheers,
>
>
> ----- Original Message -----
> >From: "Daniel van Ham Colchete" <daniel.colchete at gmail.com>
> >To: "glusterfs" <gluster-devel at nongnu.org>
> >Subject:  [Gluster-devel] DHT-based Replication
> >Date: Mon, 21 Nov 2011 21:34:30 -0200
> >
> > Good day everyone!
> >
> > It is a pleasure for me to write to this list again, after
> so many years. I
> > tested GlusterFS a lot about four years ago because of a
> project I still
> > have today. It was 2007 and I worked with Gluster 1.2 and
> 1.3. I like to
> > think that I gave one or two good ideas that were used in
> the project. Amar
> > was a great friend back then (friend is the correct word
> as he was on my
> > Orkut's friends list). Unfortunately Gluster's performance
> wasn't good
> > enough, mostly because of the ASCII protocol and I had to
> go to the HA-NFS
> > way.
> >
> > By that time, a new clustering translator was created, it
> was called DHT.
> > In those four years that passed I studied DHT and even
> implemented a
> > DHT-based internal system here at my company to create a
> PostgreSQL cluster
> > (I didn't do any reconciliation thing though, that is the
> hard part for
> > me). It worked fine, it still does!
> >
> > What is the beauty of DHT? It is easy to find where a file
> or a directory
> > is in the cluster. It is just a hash operation! You just
> create a cluster
> > of N servers and everything should be distributed evenly
> through them.
> > Every player out there got the distribution part of
> storage clusters right:
> > GlusterFS, Cassandra, MongoDB, etc.
> >
> > What is not right in my opinion is how everybody does
> replication.
> > Everything works in pairs (or triples, or replica sets of
> any number). The
> > problem with pairs (and replica sets) is that in a 10
> server cluster, if
> > one fails only its pair will have to handle the double of
> the usual load,
> > while the other ones will be working with the usual load.
> So we can assign
> > at most 50% of its capacity to any storage server. Only
> 50%. 50% efficiency
> > means we spend the double in hardware, rack space and
> energy (the worst
> > part). Even with RAID10 storage, the ones we use, have
> this problem too. If
> > a disk dies, one will get double read IOs while the other
> ones are 50% idle
> > (or should be).
> >
> > So, I have a suggestion that fixes this problem. I call it
> DHT-based
> > replication. It is different from DHT-based distribution.
> I already
> > implemented it internally, it already worked, at least
> here. Giving the
> > amount of money and energy this idea saves, I think this
> idea is worth a
> > million bucks (on your hands at least). Even though it is
> really simple.
> > I'm giving it to you guys for free. Just please give
> credit if you are
> > going to use it.
> >
> > It is very simple: hash(name) to locate the primary server
> (as usual), and
> > use another hash, like hash(name + "#copy2") for the
> second copy and so on.
> > You just have to certify that it doesn't fall into the
> same server, but
> > this is easy: hash(name + "#copy2/try2").
> >
> > So, say we have 11 storage servers. Yeah, now we can have
> a prime number of
> > servers and everything will still work fine. Imagine that
> storage server
> > #11 dies. The secondary copy of all files at #11 are
> spread across all the
> > other 10 servers, so those servers are now only getting a
> load 10% bigger!
> > Wow! Now I can use 90% of what my storages can handle and
> still be up when
> > one fails! Had I done this the way things are today, my
> system would be
> > down because there is no way a storage can handle 180% of
> IO its capacity
> > (by definition). So now I have 90% efficiency! For exactly
> the same costs
> > to have 50k users I can have 90k users now! That is a 45%
> costs savings on
> > storage, with just a simple algorithm change.
> >
> > So, what are your thoughts on the idea?
> >
> > Thank you very much!
> >
> > Best regards,
> > Daniel Colchete
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20111122/cb2033b9/attachment-0003.html>


More information about the Gluster-devel mailing list