[Gluster-devel] Split-brain present and future in afr

Tue May 20 15:56:53 UTC 2014

hi,

Thanks to Vijay Bellur for helping with the re-write of the draft I sent him :-).

Present:
Split-brains of files happen in afr today due to 2 primary reasons:

1. Split-brains due to network partition or network split-brains

2. Split-brains due to servers in a replicated group being offline at different points in time without self-heal happening in the common period of time when the servers were online. For further discussion, this is referred to as split-brain over time.

To prevent the occurence of split-brains, we have the following quorum implementations in place:

a> Client quorum - Driven by afr (client) and writes are allowed when majority of bricks in a replica group are online. Majority is by default N/2 + 1, where N is the replication factor for files in a volume.

b> Server quorum - Driven by glusterd (server) and writes are allowed when majority of peers are online. Majority by default is N/2 + 1, where N is the number of peers in a trusted storage pool.

Both a> and b> primarily safeguard network split-brains. The protection of these quorum implementations for split-brain over time scenarios is not very high.
Let us consider how replica 3 and replica 2 can be protected against split-brains.

Replica 3:
Client quorum is quite effective in this case as writes are only allowed when at least 2 of 3 bricks that form a replica group is seen by afr/client. A recent fix for a corner case race in client quorum, (http://review.gluster.org/7600) makes it very robust. This patch is now part of master and release-3.5. We plan to backport it to release-3.4 too.

Replica 2:
Majority for client quorum in a deployment with 2 bricks per replica group is 2.  Hence availability becomes a problem with replica 2 when either of the bricks is offline. To provide better avaialbility for replica-2, the first brick in a replica set is provided higher weight and quorum is met as long as the first brick is online. If the first brick is offline, then quorum is lost. 

Let us consider the following cases with B1 and B2 forming a replicated set:
                            B1                    B2                Quorum
                        Online                  Online                Met
                        Online                  Offline                 Met
                        Offline                   Offline                Not Met
                        Offline                   Offline                Not Met

Though better availability is provided by client quorum in replica 2 scenarios, it is not very optimal and hence an improvement in behavior seems desirable.
Future:

Our  focus in afr going forward would be to solve three problems to provide better protection  against split-brains and resolving them:

1. Better protection for split-brain over time.
2. Policy based split-brain resolution.
3. Provide better availability with client quorum and replica 2.

For 1, implementation of outcasting logic will address the problem:
   - An outcast is a copy of a file on which writes have been performed only when quorum is met.
   - When a brick goes down and comes back up self-heal daemon will go and mark the affected files on the brick that just came back up as outcasts. The outcast marking can be implemented even before the brick is declared available to regular clients. Once a copy of a file is marked as needing self-heal (or as an outcast), writes from clients will not land on that copy till self-heal is completed and the outcast tag is removed.

For 2,  we plan to provide commands that can heal based on user configurable policies. Examples of policies would be:
 - Pick up the largest file as the winner for resolving a self-heal
-  Choose brick foo as the winner for resolving split-brains
-  Pick up the file with the latest version as the winner (when versioning for files is available).

For 3, we are planning to introduce arbiter bricks that can be used to determine quorum. The arbiter bricks will be dummy bricks that host only files that will be updated from multiple clients. This will be achieved by bringing about variable replication count for configurable class of files within a volume.
 In the case of a replicated volume with one arbiter brick per replica group, certain files that are prone to split-brain will be in 3 bricks (2 data bricks + 1 arbiter brick).  All other files will be present in the regular data bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, sanlock is used by oVirt for arbitration. sanloclk lease files will be written by all clients and VM disks are written by only a single client at any given point of time. In this scenario, we can place sanlock lease files on 2 data + 1 arbiter bricks. The VM disk files will only be present on the 2 data bricks. Client quorum is now determined by looking at 3 bricks instead of 2 and we have better protection when network split-brains happen.

 A combination of 1. and 3. does seem like a solid foundation to prevent split-brains in use cases where there are multiple contending writers to the same file even with a replica 2 scenario.

Look forward to your thoughts and comments on the future proposal.

Pranith