[Gluster-users] split-brain on glusterfs running with quorum on server and client

Sat Sep 6 14:01:07 UTC 2014

On 09/06/2014 04:53 AM, Jeff Darcy wrote:
>> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
>> client and server quorum turned on. I rebooted one of the 3 bricks. When it
>> came back up, the client started throwing error messages that one of the
>> files went into split brain.
> This is a good example of how split brain can happen even with all kinds of
> quorum enabled.  Let's look at those xattrs.  BTW, thank you for a very
> nicely detailed bug report which includes those.
>
>> BRICK1
>> ========
>> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> getfattr: Removing leading '/' from absolute path names
>> # file:
>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> trusted.afr.PL2-client-0=0x000000000000000000000000
>> trusted.afr.PL2-client-1=0x000000010000000000000000
>> trusted.afr.PL2-client-2=0x000000010000000000000000
>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>>
>> BRICK 2
>> =======
>> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> getfattr: Removing leading '/' from absolute path names
>> # file:
>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> trusted.afr.PL2-client-0=0x00000d460000000000000000
>> trusted.afr.PL2-client-1=0x000000000000000000000000
>> trusted.afr.PL2-client-2=0x000000000000000000000000
>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>> BRICK 3
>> =========
>> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> getfattr: Removing leading '/' from absolute path names
>> # file:
>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>> trusted.afr.PL2-client-0=0x00000d460000000000000000
>> trusted.afr.PL2-client-1=0x000000000000000000000000
>> trusted.afr.PL2-client-2=0x000000000000000000000000
>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
> Here, we see that brick 1 shows a single pending operation for the other
> two, while they show 0xd46 (3398) pending operations for brick 1.
> Here's how this can happen.
>
> (1) There is exactly one pending operation.
>
> (2) Brick1 completes the write first, and says so.
>
> (3) Client sends messages to all three, saying to decrement brick1's
> count.
>
> (4) All three bricks receive and process that message.
>
> (5) Brick1 fails.
>
> (6) Brick2 and brick3 complete the write, and say so.
>
> (7) Client tells all bricks to decrement remaining counts.
>
> (8) Brick2 and brick3 receive and process that message.
>
> (9) Brick1 is dead, so its counts for brick2/3 stay at one.
>
> (10) Brick2 and brick3 have quorum, with all-zero pending counters.
>
> (11) Client sends 0xd46 more writes to brick2 and brick3.
>
> Note that at no point did we lose quorum. Note also the tight timing
> required.  If brick1 had failed an instant earlier, it would not have
> decremented its own counter.  If it had failed an instant later, it
> would have decremented brick2's and brick3's as well.  If brick1 had not
> finished first, we'd be in yet another scenario.  If delayed changelog
> had been operative, the messages at (3) and (7) would have been combined
> to leave us in yet another scenario.  As far as I can tell, we would
> have been able to resolve the conflict in all those cases.
> *** Key point: quorum enforcement does not totally eliminate split
> brain.  It only makes the frequency a few orders of magnitude lower. ***

Not quite right. After we fixed the bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1066996, the only two 
possible ways to introduce split-brain are
1) if we have an implementation bug in changelog xattr marking, I 
believe that to be the case here.
2) Keep writing to the file from the mount then
a) take brick 1 down, wait until at least one write is successful
b) bring brick1 back up and take brick 2 down (self-heal should not 
happen) wait until at least one write is successful
c) bring brick2 back up and take brick 3 down (self-heal should not 
happen) wait until at least one write is successful

With outcast implementation case-2 will also be immune to split-brain 
errors.

Then the only way we have split-brains in afr is implementation errors 
of changelog marking. If we test it thoroughly and fix such problems we 
can get it to be immune to split-brain :-).

Pranith
> So, is there any way to prevent this completely?  Some AFR enhancements,
> such as the oft-promised "outcast" feature[1], might have helped.
> NSR[2] is immune to this particular problem.  "Policy based split brain
> resolution"[3] might have resolved it automatically instead of merely
> flagging it.  Unfortunately, those are all in the future.  For now, I'd
> say the best approach is to resolve the conflict manually and try to
> move on.  Unless there's more going on than meets the eye, recurrence
> should be very unlikely.
>
> [1] http://www.gluster.org/community/documentation/index.php/Features/outcast
>
> [2] http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
>
> [3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users