[Gluster-users] HA replica

Thu Feb 18 04:03:19 UTC 2016

On 02/12/2016 12:08 PM, Mike Stump wrote:
> Ok. I’m a new user, I want to make an array with 10 machines. I want 
> to be able to able to suffer the loss of any one machine. I don’t mind 
> wasting 50% of the disk space to do this. I don’t want to suffer split 
> brain. I want the array to support both read and write access to data. 
> How do I achieve that? 

What is your acceptable annual downtime (typically outlined in an SLA or 
OLA)? That's a bit of information you should have when you're 
engineering a system.

Split-brain happens when your replication has been partitioned and 
writes have occurred in such a way that no valid copy can be discerned. 
For the sake of example, we're going to use a very simple file entitled 
"file.txt" with the contents of "The quick brown fox jumped over the 
lazy yellow dog." It exists on a replicated volume with no protection on 
a network where a server and client are in the west wing, and the 
replica server and another client are in the east wing. Somewhere in the 
middle, someone pulls the plug on the router. The west client can see 
the west server and the east client can see the east server.

The west client updates file.txt changing the word "brown" to "red". The 
east client updates the same file.txt and changes the word "brown" to 
"white".

The router recovers and the two servers try to synchronize any files 
that were changed. They both had changes to file.txt. Which one was right?

There's no way to determine that from the information given. That's 
split-brain.

How can you combat split brain?

One solution is quorum. Have enough replica that comparisons can be 
made. If two servers are in the west and only one in the east and they 
have the ability to determine quorum, the east server will not allow 
writes during the network split. It can tell that it's not safe because 
if they all three voted on which change was right, the two in the west 
would win and data would be lost. The two in the west see that one 
server is lost, but they still have quorum. They allow the data to 
remain available, knowing that the out-of-quorum server is safe from 
changes.

Gluster has the ability to have a minimally participating quorum 
participant called an arbiter. Let's make the west client an arbiter. 
The net split happens. Only the two replica exist, one in west and the 
other in east. The arbiter can see the west server but not the east. The 
east server can see neither the west server nor the arbiter. The east 
loses quorum but the west, seeing the arbiter, does still have quorum 
and remains available with the safe understanding that the east server, 
not having quorum, will not accept writes.

So with your 10 servers you could have a "replica 3 arbiter 1" volume 
with one of the replica being an arbiter. It would only use space for 
file names and metadata, but no actual data. If I were doing it, I would 
probably do it as so:

     gluster volume create myvol replica 3 arbiter 1 server1:/brick1 
server2:/brick1 server3:/arbiter \
     server3:/brick1 server4:/brick1 server5:/arbiter etc.

Notice how there's both a data directory (/brick1) and an arbiter 
directory (/arbiter) on bricks 3,5,7... which allows the data "waste" 
that you're asking for while /mostly/ allowing the availability you 
seek. I say mostly because if your network partitions, something's got 
to give or you will lose data. There's absolutely no way for 
disconnected systems to coordinate binary changes to each other with 
today's technology.

Perhaps, one day, we will have quantum tunneling networks with 
superimposed particles able to teleport data without the need of 
networks, but that's not today. When that /is/ available, I expect 
rainbows and unicorns to be available as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160217/d874bbd4/attachment.html>