[Gluster-devel] Split-brain present and future in afr

Fri May 23 09:17:00 UTC 2014

----- Original Message -----
> From: "Jeff Darcy" <jdarcy at redhat.com>
> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, May 20, 2014 10:08:12 PM
> Subject: Re: [Gluster-devel] Split-brain present and future in afr
> 
> > 1. Better protection for split-brain over time.
> > 2. Policy based split-brain resolution.
> > 3. Provide better availability with client quorum and replica 2.
> 
> I would add the following:
> 
> (4) Quorum enforcement - any kind - on by default.

For replica - 3 we can do that. For replica 2, quorum implementation at the moment is not good enough. Until we fix it correctly may be we should let it be. We can revisit that decision once we come up with better solution for replica 2.

> 
> (5) Fix the problem of volumes losing quorum because unrelated nodes
>     went down (i.e. implement volume-level quorum).
> 
> (6) Better tools for users to resolve split brain themselves.

Agreed. Already in plan for 3.6.

> 
> > For 3, we are planning to introduce arbiter bricks that can be used to
> > determine quorum. The arbiter bricks will be dummy bricks that host only
> > files that will be updated from multiple clients. This will be achieved by
> > bringing about variable replication count for configurable class of files
> > within a volume.
> >  In the case of a replicated volume with one arbiter brick per replica
> >  group,
> >  certain files that are prone to split-brain will be in 3 bricks (2 data
> >  bricks + 1 arbiter brick).  All other files will be present in the regular
> >  data bricks. For example, when oVirt VM disks are hosted on a replica 2
> >  volume, sanlock is used by oVirt for arbitration. sanloclk lease files
> >  will
> >  be written by all clients and VM disks are written by only a single client
> >  at any given point of time. In this scenario, we can place sanlock lease
> >  files on 2 data + 1 arbiter bricks. The VM disk files will only be present
> >  on the 2 data bricks. Client quorum is now determined by looking at 3
> >  bricks instead of 2 and we have better protection when network
> >  split-brains
> >  happen.
> 
> Constantly filtering requests to use either N or N+1 bricks is going to be
> complicated and hard to debug.  Every data-structure allocation or loop
> based on replica count will have to be examined, and many will have to be
> modified.  That's a *lot* of places.  This also overlaps significantly
> with functionality that can be achieved with data classification (i.e.
> supporting multiple replica levels within the same volume).  What use case
> requires that it be implemented within AFR instead of more generally and
> flexibly?

1) It wouldn't still bring in arbiter for replica 2.
2) That would need more bricks, more processes, more ports.

> 
> 

Pranith