[Gluster-users] split-brain on glusterfs running with quorum on server and client

Ramesh Natarajan ramesh25 at gmail.com
Fri Sep 19 16:28:17 UTC 2014


I was able to run another set of tests this week and I was able to
reproduce the issue again. Going by the extended attributes, I think i ran
into the same issue I saw earlier..

 Do you think i need to open up a bug report?

Brick 1:

trusted.afr.PL2-client-0=0x000000000000000000000000
trusted.afr.PL2-client-1=0x000000010000000000000000
trusted.afr.PL2-client-2=0x000000010000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c

Brick 2

trusted.afr.PL2-client-0=0x0000125c0000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c

Brick 3

trusted.afr.PL2-client-0=0x0000125c0000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c


[root at ip-172-31-12-218 ~]# gluster volume info

Volume Name: PL1
Type: Replicate
Volume ID: bd351bae-d467-4e8c-bbd2-6a0fe99c346a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol1/gluster-data
Brick2: 172.31.16.220:/data/vol1/gluster-data
Brick3: 172.31.12.218:/data/vol1/gluster-data
Options Reconfigured:
cluster.server-quorum-type: server
network.ping-timeout: 12
nfs.addr-namelookup: off
performance.cache-size: 2147483648
cluster.quorum-type: auto
performance.read-ahead: off
performance.client-io-threads: on
performance.io-thread-count: 64
cluster.eager-lock: on
cluster.server-quorum-ratio: 51%

Volume Name: PL2
Type: Replicate
Volume ID: e6ad8787-05d8-474b-bc78-748f8c13700f
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol2/gluster-data
Brick2: 172.31.16.220:/data/vol2/gluster-data
Brick3: 172.31.12.218:/data/vol2/gluster-data
Options Reconfigured:
nfs.addr-namelookup: off
cluster.server-quorum-type: server
network.ping-timeout: 12
performance.cache-size: 2147483648
cluster.quorum-type: auto
performance.read-ahead: off
performance.client-io-threads: on
performance.io-thread-count: 64
cluster.eager-lock: on
cluster.server-quorum-ratio: 51%
[root at ip-172-31-12-218 ~]#

*Mount command*

Client

mount -t glusterfs -o
defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256
172.31.16.220:/PL2  /mnt/vm

Server

/dev/xvdf    /data/vol1 xfs defaults,inode64,noatime 1 2
/dev/xvdg   /data/vol2 xfs defaults,inode64,noatime 1 2

*Packages*

Client

rpm -qa | grep gluster
glusterfs-fuse-3.5.2-1.el6.x86_64
glusterfs-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64

Server

[root at ip-172-31-12-218 ~]# rpm -qa | grep gluster
glusterfs-3.5.2-1.el6.x86_64
glusterfs-fuse-3.5.2-1.el6.x86_64
glusterfs-api-3.5.2-1.el6.x86_64
glusterfs-server-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64
glusterfs-cli-3.5.2-1.el6.x86_64
[root at ip-172-31-12-218 ~]#


On Sat, Sep 6, 2014 at 9:01 AM, Pranith Kumar Karampuri <pkarampu at redhat.com
> wrote:

>
> On 09/06/2014 04:53 AM, Jeff Darcy wrote:
>
>> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
>>> client and server quorum turned on. I rebooted one of the 3 bricks. When
>>> it
>>> came back up, the client started throwing error messages that one of the
>>> files went into split brain.
>>>
>> This is a good example of how split brain can happen even with all kinds
>> of
>> quorum enabled.  Let's look at those xattrs.  BTW, thank you for a very
>> nicely detailed bug report which includes those.
>>
>>  BRICK1
>>> ========
>>> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
>>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>>> _00_00
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>>> trusted.afr.PL2-client-0=0x000000000000000000000000
>>> trusted.afr.PL2-client-1=0x000000010000000000000000
>>> trusted.afr.PL2-client-2=0x000000010000000000000000
>>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>>>
>>> BRICK 2
>>> =======
>>> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
>>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>>> _00_00
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>>> trusted.afr.PL2-client-0=0x00000d460000000000000000
>>> trusted.afr.PL2-client-1=0x000000000000000000000000
>>> trusted.afr.PL2-client-2=0x000000000000000000000000
>>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>>> BRICK 3
>>> =========
>>> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
>>> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>>> _00_00
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
>>> trusted.afr.PL2-client-0=0x00000d460000000000000000
>>> trusted.afr.PL2-client-1=0x000000000000000000000000
>>> trusted.afr.PL2-client-2=0x000000000000000000000000
>>> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>>>
>> Here, we see that brick 1 shows a single pending operation for the other
>> two, while they show 0xd46 (3398) pending operations for brick 1.
>> Here's how this can happen.
>>
>> (1) There is exactly one pending operation.
>>
>> (2) Brick1 completes the write first, and says so.
>>
>> (3) Client sends messages to all three, saying to decrement brick1's
>> count.
>>
>> (4) All three bricks receive and process that message.
>>
>> (5) Brick1 fails.
>>
>> (6) Brick2 and brick3 complete the write, and say so.
>>
>> (7) Client tells all bricks to decrement remaining counts.
>>
>> (8) Brick2 and brick3 receive and process that message.
>>
>> (9) Brick1 is dead, so its counts for brick2/3 stay at one.
>>
>> (10) Brick2 and brick3 have quorum, with all-zero pending counters.
>>
>> (11) Client sends 0xd46 more writes to brick2 and brick3.
>>
>> Note that at no point did we lose quorum. Note also the tight timing
>> required.  If brick1 had failed an instant earlier, it would not have
>> decremented its own counter.  If it had failed an instant later, it
>> would have decremented brick2's and brick3's as well.  If brick1 had not
>> finished first, we'd be in yet another scenario.  If delayed changelog
>> had been operative, the messages at (3) and (7) would have been combined
>> to leave us in yet another scenario.  As far as I can tell, we would
>> have been able to resolve the conflict in all those cases.
>> *** Key point: quorum enforcement does not totally eliminate split
>> brain.  It only makes the frequency a few orders of magnitude lower. ***
>>
>
> Not quite right. After we fixed the bug https://bugzilla.redhat.com/
> show_bug.cgi?id=1066996, the only two possible ways to introduce
> split-brain are
> 1) if we have an implementation bug in changelog xattr marking, I believe
> that to be the case here.
> 2) Keep writing to the file from the mount then
> a) take brick 1 down, wait until at least one write is successful
> b) bring brick1 back up and take brick 2 down (self-heal should not
> happen) wait until at least one write is successful
> c) bring brick2 back up and take brick 3 down (self-heal should not
> happen) wait until at least one write is successful
>
> With outcast implementation case-2 will also be immune to split-brain
> errors.
>
> Then the only way we have split-brains in afr is implementation errors of
> changelog marking. If we test it thoroughly and fix such problems we can
> get it to be immune to split-brain :-).
>
> Pranith
>
>> So, is there any way to prevent this completely?  Some AFR enhancements,
>> such as the oft-promised "outcast" feature[1], might have helped.
>> NSR[2] is immune to this particular problem.  "Policy based split brain
>> resolution"[3] might have resolved it automatically instead of merely
>> flagging it.  Unfortunately, those are all in the future.  For now, I'd
>> say the best approach is to resolve the conflict manually and try to
>> move on.  Unless there's more going on than meets the eye, recurrence
>> should be very unlikely.
>>
>> [1] http://www.gluster.org/community/documentation/index.
>> php/Features/outcast
>>
>> [2] http://www.gluster.org/community/documentation/index.
>> php/Features/new-style-replication
>>
>> [3] http://www.gluster.org/community/documentation/index.
>> php/Features/pbspbr
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140919/dbe6ca9c/attachment.html>


More information about the Gluster-users mailing list