[Gluster-users] How to re-sync

Wed Mar 10 17:45:54 UTC 2010

 > When the server dies suddenly it should tell people it dies suddenly??
 > Hmm, this needs some more thought...

Not the server that went down, the first client that notices (times out on the server) should send out the notifications.
I thought this note made that clear, my bad.
 >> (in this case to all the clients notifying them not to use the downed server [I hope this removes the small hang delay for clients 2-N], as well as via 
e-mail to the sysadmin).

^C


Ed W wrote:
> Hi
> 
>> 1. I think when a server goes down it should be "flagged as faulty" 
>> and send out notifications (in this case to all the clients notifying 
>> them not to use the downed server [I hope this removes the small hang 
>> delay for clients 2-N], as well as via e-mail to the sysadmin).
> 
> When the server dies suddenly it should tell people it dies suddenly?? 
> Hmm, this needs some more thought...
> 
> However, I guess it could be useful to have a way to take servers out of 
> service for scheduled reasons?  Perhaps this is what you meant?
> 
>> 3. Then when the down server comes back and starts glusterfsd it 
>> remains "faulty" and no client can use it.
> 
> I agree with where you are going, but if the autoheal works as 
> advertised then there is no reason to stop any client using it - it will 
> simply self heal as soon as someone requests a file which is stale (this 
> is at  least what it's claimed to do...)
> 
>> 5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync 
>> flag" tells the clients to write to the machine, but not read from it 
>> while it recovers).
>> 6. A sysadmin then runs a re-sync (ls -alR).
>> 7. Once the re-sync completes a sysadmin runs a "re-add" command 
>> removing the "faulty flag" and the clients can begin using the server 
>> again.
> 
> I do agree that it would be very helpful to have an idea of whether 
> servers are properly in sync or not though.
> 
> Consider the scenario of upgrading a cluster, ie take down S1, upgrade 
> it, then bring it up again, take down S2, upgrade it, then bring it up 
> again.  If you don't fully sync S1 and S2 in the middle then you have a 
> split brain situation which must lead to data loss...
> 
> Perhaps the ls -alR is 100% sufficient to guarantee the entire 
> filesystem is synced and hence is completely sufficient, but split brain 
> IS the major fear with clustered systems and it would be nice to have 
> even stronger guarantees of consistency...
> 
> 
>> I feel that this method removes the chance that a server goes down, 
>> gets out of sync, recovers on its own (or though automated tools), and 
>> starts providing services with some old data.
>> In the middle of the night if the server goes down, and nagios trips a 
>> reboot, then the server comes up, no sysadmin is logged in to run the 
>> "ls -alR" to get the server to re-sync.
> 
> Yeah, I agree that this scenario is scary.  Actually you missed out an 
> implied step which is if the *other* server dies before the resync 
> happens then you have a risk of split brain.
> 
> Arguably it's not necessary to fence the recovering server during the 
> recovery, but you definitely want to fence it if cannot completely 
> resync for some reason...
> 
> 
> Ed W
> 
>