[Gluster-devel] Fwd: questions about fault tolerance

Wed Mar 28 09:24:07 UTC 2007

Hello Mark,

> First, this looks like a great project.  Hats off to the developers.

Thanks, your words are encouraging!

> But in fact if a cluster member is not reliably up, this safety is
> actually lost, right? E.g. if a member goes down and stays down then
> things are OK b/c the replica(s) can be used, but if we allow that
> member to come back w/o recognizing its absence, then things could get
> bad.  Is anyone using 1.3 and handling this condition?  Or are people
> just playing with 1.3 and waiting for 1.4?

the 'recovery' is going to come as part of the 'self-heal' feature of
1.4. for 1.3 this has to be done out-of-band with some management
scripts to copy files around.

> There are a few projects approaching distributed fault tolerance, or
> making it manageable, it seems (ceph, lustre, gfarm, ...), what are
> people using until then?

lustre's redundancy is by replicating the entire OSTs. glusterfs's
selective file based replication demands a slightly more detailed
recovery process. I'm not sure how ceph or gfram handles redundancy.

> On another note, it seems an alternate/ill-conceived(?) way to
> implement afr could be to remove the burden of detailing the
> replication from the user and put it instead on the scheduler?  E.g.
> allow the "option replicate *:2" spec to go into the "type
> cluster/unify" block;

This approach what you mention is less flexible/modular. the job of
unify is to aggregate namespace. the job of afr is to replicate files.
each of these features needs to be usable with any other translators,
with or without each other. mixing afr and unify will making
consistancy check of namespace more tricky (though still possible
ofcourse.) As a matter of design principle we chose to keep differnt
things seperate which really pays off in the long run.

> this way files would be spread out in an
> arbitrary fashion (compared to suggested 1-2 2-3 ... setups) and the
> loss of 2 machines wouldn't eliminate 1/N of the storage.

loss of 2 machiens does not ensure a copy of a  file being present in
the reamaining nodes in either of the scenarios.

in the current situation the argument is '2 out of the given 3 nodes'
going down has very low probability, while in your argument 'a file being
replicated in the particular two nodes which went down' has low
probability. overall neither can prove to have better redundancy than
the other.

regards,
avati

-- 
Shaw's Principle:
        Build a system that even a fool can use,
        and only a fool will want to use it.