[Gluster-users] Recovering out of sync nodes from input/output error

Thu Apr 12 15:09:03 UTC 2012

On 04/12/12 03:15, Alex Florescu wrote:
> What do you mean by "source of truth"?
> howareyou did not exist on 10.0.2.15, it was created only on 10.0.2.14
> and ideally it should have been replicated to .15 when connectivity
> recovered.

Actually, because there is no transaction log (no history kept), at the 
time of recovery you have these scenarios (this is per file):

1) "howareyou" was created on .14 and never existed on .15
    (.14 wins and the file is copied to .15)
2) "howareyou" existed on both .14 and .15, then .15 deleted it
    (.15 wins and the file is deleted on .14)

Which state do you choose? (with no knowledge of how you got there)
You can choose only one.

> Actually, both states are correct.

They can't be (when looked at per file.)  In the end only one state can
exist.

> Imagine /a is the document root of a website deployed on 2 servers. By
 > using DNS round robin, one request was balanced to 10.0.2.14 and created
 > the file howareyou, and the second request was balanced to 10.0.2.15 and
 > created the file hello. If gluster wasn't having connectivity issues,
 > everything would have been fine and the files would have been replicated
 > among the 2 servers. But gluster was having connectivity issues at that
 > moment, while the other services (apache) were not, and when gluster
 > connectivity recovered the split-brain occurred.

The split brain is not when the recovery happens, it is when the 
connectivity is broken.  Split-brain is when two halves of a cluster 
both think the other is down and operate independently.  You have gone 
from one synchronized state to two independent states.

The two independent states must be merged back into one synchronized 
state on recovery.

Without a transaction log on both sides that can be replayed onto the 
peer, the machine can't decide who wins *each* discrepancy.  Only you 
with your human knowledge and reasoning can make that decision. (And 
sometimes even you won't know.)  Or you tell the machine that state "A" 
always wins.

This is why things like STONITH (Shoot The Other Node In The Head) 
exists for HA clustering.  When a failure happens the secondary machine 
literally kills (powers off) the primary, to make sure that a 
split-brain does not happen.

For the new (3.3) version of GlusterFS, there will be a quorum 
implementation.  This will require 3 or more replicas.  With that, when 
a split-brain happens, the one with quorum (defined as 51%+ of the 
replicas still intact) stays operational. While the other side goes into 
read-only.

>     Tar replica1 and untar on replica2.  Then delete everything on replica1.
>     Then self-heal should take care of the rest.
>
>
> This does not have any effect. As I cannot work anymore with the
> mountpoint /a, I am left with modifying only the local directory, named
> by me /local. Error persists.

For the error persisting, I don't know why.

-- 
Mr. Flibble
King of the Potato People