[Gluster-users] any configuration guidelines?

Wei Dong wdong.pku at gmail.com
Tue Jul 28 17:19:09 UTC 2009


Hi All,

We've been using GlusterFS 2.0.1 on our lab cluster to host a large 
number of small images for distributed processing with Hadoop and it has 
been working fine without human intervention for a couple of months.  
Thanks for the wonderful project -- it's the only freely available 
cluster filesystem that fits our needs.

What keeps bothering me is the extremely high flexibility of ClusterFS.  
There's simply so many ways to achieve the same goal that I don't know 
which is the best.  So I'm writing to ask if there are some general 
guidelines of configuration to improve both data safety and performance.

Specifically, we have 66 machines (in two racks) with 4 x 1.5TB disks / 
machine.  We want to aggregate all the available disk space into a 
single shared directory with 3 replications..  Following are some of the 
potential configurations.

*  Each node exports 4 directories, so there are 66x4 = 264 directories 
to the client.  We then first group those directories into threes with 
AFR, making 88 replicated directories, and then aggregate them with 
DHT.  When configuring AFR, we can either make the three replicates on 
different machines, or two on the same machine and the third on another 
machine.

*  Each node first aggregates three disks (forget about the 4th for 
simplicity) and exports a replicated directory.  The client side then 
aggregates the 66 single replicated directory into one.

* When grouping the aggregated directories on the client side, we can 
use some kind of hierarchy.  For example the 66 directories are first 
aggregated into groups of N each with DHT, and then the 66/N groups are 
again aggregated with DHT.

*  We don't do the grouping on the client side.  Rather, we use some 
intermediate server to first aggregate small groups of directories with 
DHT and export them as a single directory.

* We can also put AFR after DHT
......

To make things more complicated, the 66 machines are separated into two 
racks with only 4-gigabit inter-rack connection, so all the directories 
exported by the servers are not equal to a particular client.

I'm wondering if someone on the mailing list could provide me with some 
advice.

Thanks a lot.

- Wei



More information about the Gluster-users mailing list