[Gluster-devel] Gluster Recovery

Anand Avati avati at zresearch.com
Sun Apr 29 22:20:22 UTC 2007

Replies inline

On Sun, Apr 29, 2007 at 02:10:04PM +0200, Danson Michael Joseph wrote:
> Hi all,
> To come back to the issue of "self-healing" in an AFR situation.  
> Consider the rather complex situation below where A=AFR, S=stripe and 
> U=unify:
>                                                 /---- Server1---\      
>     Client3
>                                               S---- Server2 ----S      
>   /         
>                               /-------U---<                           
> >---U--------\
>                             /                 S---- Server3 ----S      
>           \
>                            /                    \---- Server4 ---/      
>             \
> Client1   ------A--<                                                   
>              >---A------Client2
>                            \                     /---- Server5---\      
>             /
>                             \                  S---- Server6 ----S      
>          /
>                               \-------U---<                            
> >---U-------/
>                                      /         S---- Server7 ----S
>                               Client4       \---- Server8 ---/
> Client1 and 2: AFR of two separate unions of two separate stripes
> Client3 and 4: Union of two separate stripes
> I think this is quite a complex arrangement and could probably account 
> for 80% of large installation cases.  The obvious question here is what 
> would the method of healing be for a server failure.  

In the above scenario, you are using the storage cluster with
different 'views' from differnt client. this is basically 'not
supported' by glusterfs. you are expected to use the same spec file on
all clients. the configuration described above would work ONLY if you are
extremely careful and you know what you are doing (which is why the
possibility is retained) 

> Some thoughts:
> 1) As mentioned later on in this thread, the flexibility of gluster is 
> great but it is somewhat rediculouos to imagine that this flexibility 
> frees one from using good cluster design.  For instance, the following 
> configuration is probably of little use, the clients must have a useful 
> configuration, possible like the larger one above:
>                       /---Server1---\
> Client1---S---<                     >---U---Client2
>                       \---Server2---/

we the developers are trying to push the image of glusterfs as a
'programmable filesystem', where the user is given a bunch of
functionalities as translators, and some glue code to mount and
daemonize. having a programmable system also implies having
responsibility. along with the strength you get from building a server
and client independantly, you also get the other side of the coin that
if they are built differntly and inconsitantly, you risk the
possibility of getting a useless setup.

(a rough example would be to complain that programming language C lets
you derefer pointers without a validity check.
while it gives you tremendous power, it also comes with the risk that
if you do a (*NULL) your code segfaults)

> 2) If a server is replaced, healing must take place from any or all 
> clients, otherwise the distributed nature of the system is lost.

if each client has a differnt view of the cluster (which somehow you
have managed to keep the overall system sane) then yes, each client
(atleast one from a view-type) should run consistancy check of its
view. else just one client is sufficient.

> 3) No client should exist below a striping such as:
>               Client2
>                         \    /---- Server1...
>                           U---- Server2...
> ...Client1---S---<                      
>                           U---- Server3....
>                            \---- Server4....
> Correct me if I'm wrong, but trying to read striped data as the above 
> drawing shows for client2 would not be very useful to client2.

the above configuration *could* exist. if for some strange reason you
want client2 to see only certain stripes of a file (rest of the file
is seen as 'holes') the above configuration works, ofcourse the
assumption is that client2 knows it is seeing just a few stripes of
the entire file and conforms its usage of the file with that fact.

> 4) A suggestion here is to have each AFR client with a self-heal 
> filter/translator. ONLY AFR clients should have self-healing for 
> replication.  Other clients such as the union clients can have 
> self-healing filters but for different filesystem health checks.  When a 
> server fails and is replaced, all AFR clients get stuck in and attempt 
> to reconstruct the data.  Thus in this situation, Clients 1 and 2 will 
> heal the system.  Clients 3 and 4 cannot because they don't have a full 
> set of data from which to work. 

Ofcourse, self-heal is not a 'single entity'. each translator
_contributes_ a chunk of sanity check (from its level of view) to the
overall filesystem check. AFR only checks for proper replication.
unify checks for uniform directory structure and file resides on only
one child, etc.

> 5) Who is the dominant reconstruction client?  A simple possible 
> solution is to have a "pre-healing" lock for each file to be 
> reconstructed.  For instance, Client1 finds "hello.c" in bad shape 
> because of the failure.  Client1 placed a lock file in the directory 
> identifying itself with a timestamp.   Client2 also notices that 
> "hello.c" is in bad shape and moves to fix, but notices a lock file with 
> a timestamp on it, and so will move on to another file/folder.  If 
> Client2 notices that the timestamp has not been updated in 20s or 
> something reasonable, that means that Client1 has crashed or failed in 
> some manner and is no longer healing "hello.c".  Therefore Client2 will 
> continue to heal "hello.c".  Obviously, during healing, nothing else 
> should access the file for fear of further corruption.  Comments on that 
> may run far, but so be it. 

your suggestion is valid. noted.

> 6) What it all comes down to is: 1) do not make the system's distributed 
> nature worthless; let all clients get stuck in as if they were all 
> trying to make breakfast.  If someone is making the eggs, don't make 
> eggs, go make the toast.  If the eggs start burning because the cook 
> went to the toilet, take over and finish the eggs.  Soon enough, with 
> clever co-operation, the breakfast will be done.

its breakfast time for me now :)



deep_thought (void)
  sleep (years2secs (7500000)); 
  return 42;

More information about the Gluster-devel mailing list