[Gluster-devel] DHT-based Replication

Tue Nov 22 06:25:07 UTC 2011

Hello,

  I think what you're describing is functionally equivalent
to the Google File System;
    http://labs.google.com/papers/gfs.html
    http://en.wikipedia.org/wiki/Google_File_System

  And, by comparable design, Hadoop File System;

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

  Google/Hadoop FS framed a set of principles that I 
(and a friend of mine - Michael Welton) were hoping to 
achieve by moving to GlusterFS - and I think it will still 
get us all the way there once we get IPSEC underneath 
it and it deals with WANs better.

  In that context, I'm a fan already :)   

  However, I would like to add that if you're going to go 
to the trouble of providing a hashed-block abstraction 
layer, that you make it pluggable.  I.e.  before I was taken
ill in September 2009 I was looking at implementing a 
de-dupe block for GlusterFS.  Another friend (Chris 
Kelada) and I came up with a full de-dupe model one 
evening which I thought was quite simple.  What stalled 
me before I fell ill was an issue with descriptors that may 
have been fixed during the BSD porting activity that I saw
on this list a week or two ago.  
  De-dupe needs a hash for a chunk of data.  What we 
proposed was an arbitrary sized chunk (say 1Mbyte) 
hashed, then stored in the underlying layer via its hash - 
i.e. if my block hashed to "thisismyhash" then it would be 
stored in /.dedupe/thi/sis/myh/ash/data.bin with provision 
for an index to allow multiple blocks of the same hash to 
co-exist at the same location (i.e. either n x chunk_size 
offset into the data file, or n data files).

  This would have a high read/write performance penalty 
(lots of iops in the underlying layer per iop in the de-dupe
layer) but it reaps the obvious space benefit.

  Thus if a data hashing block were to be forged, its re-
application could be quite broad - but only if considered
beforehand; then I'd be all for it.

Cheers,

----- Original Message -----
>From: "Daniel van Ham Colchete" <daniel.colchete at gmail.com>
>To: "glusterfs" <gluster-devel at nongnu.org>
>Subject:  [Gluster-devel] DHT-based Replication
>Date: Mon, 21 Nov 2011 21:34:30 -0200
>
> Good day everyone!
> 
> It is a pleasure for me to write to this list again, after
so many years. I
> tested GlusterFS a lot about four years ago because of a
project I still
> have today. It was 2007 and I worked with Gluster 1.2 and
1.3. I like to
> think that I gave one or two good ideas that were used in
the project. Amar
> was a great friend back then (friend is the correct word
as he was on my
> Orkut's friends list). Unfortunately Gluster's performance
wasn't good
> enough, mostly because of the ASCII protocol and I had to
go to the HA-NFS
> way.
> 
> By that time, a new clustering translator was created, it
was called DHT.
> In those four years that passed I studied DHT and even
implemented a
> DHT-based internal system here at my company to create a
PostgreSQL cluster
> (I didn't do any reconciliation thing though, that is the
hard part for
> me). It worked fine, it still does!
> 
> What is the beauty of DHT? It is easy to find where a file
or a directory
> is in the cluster. It is just a hash operation! You just
create a cluster
> of N servers and everything should be distributed evenly
through them.
> Every player out there got the distribution part of
storage clusters right:
> GlusterFS, Cassandra, MongoDB, etc.
> 
> What is not right in my opinion is how everybody does
replication.
> Everything works in pairs (or triples, or replica sets of
any number). The
> problem with pairs (and replica sets) is that in a 10
server cluster, if
> one fails only its pair will have to handle the double of
the usual load,
> while the other ones will be working with the usual load.
So we can assign
> at most 50% of its capacity to any storage server. Only
50%. 50% efficiency
> means we spend the double in hardware, rack space and
energy (the worst
> part). Even with RAID10 storage, the ones we use, have
this problem too. If
> a disk dies, one will get double read IOs while the other
ones are 50% idle
> (or should be).
> 
> So, I have a suggestion that fixes this problem. I call it
DHT-based
> replication. It is different from DHT-based distribution.
I already
> implemented it internally, it already worked, at least
here. Giving the
> amount of money and energy this idea saves, I think this
idea is worth a
> million bucks (on your hands at least). Even though it is
really simple.
> I'm giving it to you guys for free. Just please give
credit if you are
> going to use it.
> 
> It is very simple: hash(name) to locate the primary server
(as usual), and
> use another hash, like hash(name + "#copy2") for the
second copy and so on.
> You just have to certify that it doesn't fall into the
same server, but
> this is easy: hash(name + "#copy2/try2").
> 
> So, say we have 11 storage servers. Yeah, now we can have
a prime number of
> servers and everything will still work fine. Imagine that
storage server
> #11 dies. The secondary copy of all files at #11 are
spread across all the
> other 10 servers, so those servers are now only getting a
load 10% bigger!
> Wow! Now I can use 90% of what my storages can handle and
still be up when
> one fails! Had I done this the way things are today, my
system would be
> down because there is no way a storage can handle 180% of
IO its capacity
> (by definition). So now I have 90% efficiency! For exactly
the same costs
> to have 50k users I can have 90k users now! That is a 45%
costs savings on
> storage, with just a simple algorithm change.
> 
> So, what are your thoughts on the idea?
> 
> Thank you very much!
> 
> Best regards,
> Daniel Colchete
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 

--
Ian Latter
Late night coder ..
http://midnightcode.org/