[Gluster-users] How to re-sync
Chad
ccolumbu at hotmail.com
Wed Mar 10 17:45:54 UTC 2010
> When the server dies suddenly it should tell people it dies suddenly??
> Hmm, this needs some more thought...
Not the server that went down, the first client that notices (times out on the server) should send out the notifications.
I thought this note made that clear, my bad.
>> (in this case to all the clients notifying them not to use the downed server [I hope this removes the small hang delay for clients 2-N], as well as via
e-mail to the sysadmin).
^C
Ed W wrote:
> Hi
>
>> 1. I think when a server goes down it should be "flagged as faulty"
>> and send out notifications (in this case to all the clients notifying
>> them not to use the downed server [I hope this removes the small hang
>> delay for clients 2-N], as well as via e-mail to the sysadmin).
>
> When the server dies suddenly it should tell people it dies suddenly??
> Hmm, this needs some more thought...
>
> However, I guess it could be useful to have a way to take servers out of
> service for scheduled reasons? Perhaps this is what you meant?
>
>> 3. Then when the down server comes back and starts glusterfsd it
>> remains "faulty" and no client can use it.
>
> I agree with where you are going, but if the autoheal works as
> advertised then there is no reason to stop any client using it - it will
> simply self heal as soon as someone requests a file which is stale (this
> is at least what it's claimed to do...)
>
>> 5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync
>> flag" tells the clients to write to the machine, but not read from it
>> while it recovers).
>> 6. A sysadmin then runs a re-sync (ls -alR).
>> 7. Once the re-sync completes a sysadmin runs a "re-add" command
>> removing the "faulty flag" and the clients can begin using the server
>> again.
>
> I do agree that it would be very helpful to have an idea of whether
> servers are properly in sync or not though.
>
> Consider the scenario of upgrading a cluster, ie take down S1, upgrade
> it, then bring it up again, take down S2, upgrade it, then bring it up
> again. If you don't fully sync S1 and S2 in the middle then you have a
> split brain situation which must lead to data loss...
>
> Perhaps the ls -alR is 100% sufficient to guarantee the entire
> filesystem is synced and hence is completely sufficient, but split brain
> IS the major fear with clustered systems and it would be nice to have
> even stronger guarantees of consistency...
>
>
>> I feel that this method removes the chance that a server goes down,
>> gets out of sync, recovers on its own (or though automated tools), and
>> starts providing services with some old data.
>> In the middle of the night if the server goes down, and nagios trips a
>> reboot, then the server comes up, no sysadmin is logged in to run the
>> "ls -alR" to get the server to re-sync.
>
> Yeah, I agree that this scenario is scary. Actually you missed out an
> implied step which is if the *other* server dies before the resync
> happens then you have a risk of split brain.
>
> Arguably it's not necessary to fence the recovering server during the
> recovery, but you definitely want to fence it if cannot completely
> resync for some reason...
>
>
> Ed W
>
>
More information about the Gluster-users
mailing list