[Gluster-devel] Fwd: questions about fault tolerance
Mark Brady
mark at baybrady.org
Tue Mar 27 21:16:19 UTC 2007
First, this looks like a great project. Hats off to the developers.
Like the post from Pooya a few days ago, I too would like to cluster
storage from several (10-20) machines, each with a few disks, to be
used for serving up mostly static data (a lot of it, but not striped).
I understand with the current 1.3 release I can do this, and with
"type cluster/afr" I can specify where multiple copies should go, for
some redundancy safety.
But in fact if a cluster member is not reliably up, this safety is
actually lost, right? E.g. if a member goes down and stays down then
things are OK b/c the replica(s) can be used, but if we allow that
member to come back w/o recognizing its absence, then things could get
bad. Is anyone using 1.3 and handling this condition? Or are people
just playing with 1.3 and waiting for 1.4?
There are a few projects approaching distributed fault tolerance, or
making it manageable, it seems (ceph, lustre, gfarm, ...), what are
people using until then?
On another note, it seems an alternate/ill-conceived(?) way to
implement afr could be to remove the burden of detailing the
replication from the user and put it instead on the scheduler? E.g.
allow the "option replicate *:2" spec to go into the "type
cluster/unify" block; this way files would be spread out in an
arbitrary fashion (compared to suggested 1-2 2-3 ... setups) and the
loss of 2 machines wouldn't eliminate 1/N of the storage.
Thanks again for the work so far.
Mark
More information about the Gluster-devel
mailing list