[Gluster-devel] Gluster Recovery

Sun Apr 29 12:10:04 UTC 2007

Hi all,

To come back to the issue of "self-healing" in an AFR situation.  
Consider the rather complex situation below where A=AFR, S=stripe and 
U=unify:

                                                 /---- Server1---\      
     Client3
                                               S---- Server2 ----S      
   /         
                               /-------U---<                           
 >---U--------\
                             /                 S---- Server3 ----S      
           \
                            /                    \---- Server4 ---/      
             \
Client1   ------A--<                                                   
              >---A------Client2
                            \                     /---- Server5---\      
             /
                             \                  S---- Server6 ----S      
          /
                               \-------U---<                            
 >---U-------/
                                      /         S---- Server7 ----S
                               Client4       \---- Server8 ---/

Client1 and 2: AFR of two separate unions of two separate stripes
Client3 and 4: Union of two separate stripes

I think this is quite a complex arrangement and could probably account 
for 80% of large installation cases.  The obvious question here is what 
would the method of healing be for a server failure.  Some thoughts:

1) As mentioned later on in this thread, the flexibility of gluster is 
great but it is somewhat rediculouos to imagine that this flexibility 
frees one from using good cluster design.  For instance, the following 
configuration is probably of little use, the clients must have a useful 
configuration, possible like the larger one above:
                       /---Server1---\
Client1---S---<                     >---U---Client2
                       \---Server2---/

2) If a server is replaced, healing must take place from any or all 
clients, otherwise the distributed nature of the system is lost.

3) No client should exist below a striping such as:
               Client2
                         \    /---- Server1...
                           U---- Server2...
...Client1---S---<                      
                           U---- Server3....
                            \---- Server4....
Correct me if I'm wrong, but trying to read striped data as the above 
drawing shows for client2 would not be very useful to client2.

4) A suggestion here is to have each AFR client with a self-heal 
filter/translator. ONLY AFR clients should have self-healing for 
replication.  Other clients such as the union clients can have 
self-healing filters but for different filesystem health checks.  When a 
server fails and is replaced, all AFR clients get stuck in and attempt 
to reconstruct the data.  Thus in this situation, Clients 1 and 2 will 
heal the system.  Clients 3 and 4 cannot because they don't have a full 
set of data from which to work. 

5) Who is the dominant reconstruction client?  A simple possible 
solution is to have a "pre-healing" lock for each file to be 
reconstructed.  For instance, Client1 finds "hello.c" in bad shape 
because of the failure.  Client1 placed a lock file in the directory 
identifying itself with a timestamp.   Client2 also notices that 
"hello.c" is in bad shape and moves to fix, but notices a lock file with 
a timestamp on it, and so will move on to another file/folder.  If 
Client2 notices that the timestamp has not been updated in 20s or 
something reasonable, that means that Client1 has crashed or failed in 
some manner and is no longer healing "hello.c".  Therefore Client2 will 
continue to heal "hello.c".  Obviously, during healing, nothing else 
should access the file for fear of further corruption.  Comments on that 
may run far, but so be it. 

6) What it all comes down to is: 1) do not make the system's distributed 
nature worthless; let all clients get stuck in as if they were all 
trying to make breakfast.  If someone is making the eggs, don't make 
eggs, go make the toast.  If the eggs start burning because the cook 
went to the toilet, take over and finish the eggs.  Soon enough, with 
clever co-operation, the breakfast will be done.

Comments?

Regards,
Danson Joseph

Anand Avati wrote:
>> The concern here is the following though:
>>
>> Two separate clients are identically configured to use AFR to two identical server configurations as follows:
>>              Server1
>>            /         \
>> Client1 ---           ---Client2
>>            \         /
>>              Server2
>>
>> Client1 puts "hello.c" onto both Server1 and Server2 via AFR.  Client2 then changes hello.c in some way.
>> Server1 goes down; data lost, no chance of recovery and is replaced by Server3, a brand new server with fresh disks.
>> In this case, how does the data get reconstructed from the client's side because you mentioned that the automatic recovery was going to be on the glusterfs side.  Client1 believes hello.c is something different to what Client2 believes.  Which client will responsibly reconstruct the data?  Will the journaling of the remaining servers be used to reconstruct the data on the new server?
>>     
>
> 'changes' are done in sync on both server1 and server2 always
> (writes()s are sent to all child nodes). when server3 comes in place
> of server1, the self-heal should detect that hello.c is missing on
> server3 and sync it from server2.
>
>
>
> regards,
> avati
>
>