[Gluster-users] split-brain on glusterfs running with quorum on server and client
Pranith Kumar Karampuri
pkarampu at redhat.com
Sat Sep 20 03:19:26 UTC 2014
On 09/19/2014 09:58 PM, Ramesh Natarajan wrote:
> I was able to run another set of tests this week and I was able to
> reproduce the issue again. Going by the extended attributes, I think i
> ran into the same issue I saw earlier..
>
> Do you think i need to open up a bug report?
hi Ramesh,
I already fixed this bug. http://review.gluster.org/8757. We
should have the fix in next 3.5.x release I believe.
Pranith
>
> Brick 1:
>
> trusted.afr.PL2-client-0=0x000000000000000000000000
> trusted.afr.PL2-client-1=0x000000010000000000000000
> trusted.afr.PL2-client-2=0x000000010000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
> Brick 2
>
> trusted.afr.PL2-client-0=0x0000125c0000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
> Brick 3
>
> trusted.afr.PL2-client-0=0x0000125c0000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
>
> [root at ip-172-31-12-218 ~]# gluster volume info
> Volume Name: PL1
> Type: Replicate
> Volume ID: bd351bae-d467-4e8c-bbd2-6a0fe99c346a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol1/gluster-data
> Brick2: 172.31.16.220:/data/vol1/gluster-data
> Brick3: 172.31.12.218:/data/vol1/gluster-data
> Options Reconfigured:
> cluster.server-quorum-type: server
> network.ping-timeout: 12
> nfs.addr-namelookup: off
> performance.cache-size: 2147483648
> cluster.quorum-type: auto
> performance.read-ahead: off
> performance.client-io-threads: on
> performance.io-thread-count: 64
> cluster.eager-lock: on
> cluster.server-quorum-ratio: 51%
> Volume Name: PL2
> Type: Replicate
> Volume ID: e6ad8787-05d8-474b-bc78-748f8c13700f
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol2/gluster-data
> Brick2: 172.31.16.220:/data/vol2/gluster-data
> Brick3: 172.31.12.218:/data/vol2/gluster-data
> Options Reconfigured:
> nfs.addr-namelookup: off
> cluster.server-quorum-type: server
> network.ping-timeout: 12
> performance.cache-size: 2147483648
> cluster.quorum-type: auto
> performance.read-ahead: off
> performance.client-io-threads: on
> performance.io-thread-count: 64
> cluster.eager-lock: on
> cluster.server-quorum-ratio: 51%
> [root at ip-172-31-12-218 ~]#
>
> *Mount command*
>
> Client
>
> mount -t glusterfs -o
> defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256
> 172.31.16.220:/PL2 /mnt/vm
>
> Server
>
> /dev/xvdf /data/vol1 xfs defaults,inode64,noatime 1 2
> /dev/xvdg /data/vol2 xfs defaults,inode64,noatime 1 2
>
> *Packages*
>
> Client
>
> rpm -qa | grep gluster
> glusterfs-fuse-3.5.2-1.el6.x86_64
> glusterfs-3.5.2-1.el6.x86_64
> glusterfs-libs-3.5.2-1.el6.x86_64
>
> Server
>
> [root at ip-172-31-12-218 ~]# rpm -qa | grep gluster
> glusterfs-3.5.2-1.el6.x86_64
> glusterfs-fuse-3.5.2-1.el6.x86_64
> glusterfs-api-3.5.2-1.el6.x86_64
> glusterfs-server-3.5.2-1.el6.x86_64
> glusterfs-libs-3.5.2-1.el6.x86_64
> glusterfs-cli-3.5.2-1.el6.x86_64
> [root at ip-172-31-12-218 ~]#
>
>
> On Sat, Sep 6, 2014 at 9:01 AM, Pranith Kumar Karampuri
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
>
> On 09/06/2014 04:53 AM, Jeff Darcy wrote:
>
> I have a replicate glusterfs setup on 3 Bricks ( replicate
> = 3 ). I have
> client and server quorum turned on. I rebooted one of the
> 3 bricks. When it
> came back up, the client started throwing error messages
> that one of the
> files went into split brain.
>
> This is a good example of how split brain can happen even with
> all kinds of
> quorum enabled. Let's look at those xattrs. BTW, thank you
> for a very
> nicely detailed bug report which includes those.
>
> BRICK1
> ========
> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> trusted.afr.PL2-client-0=0x000000000000000000000000
> trusted.afr.PL2-client-1=0x000000010000000000000000
> trusted.afr.PL2-client-2=0x000000010000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
> BRICK 2
> =======
> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
> BRICK 3
> =========
> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
> <tel:2014-09-05-17>_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
> Here, we see that brick 1 shows a single pending operation for
> the other
> two, while they show 0xd46 (3398) pending operations for brick 1.
> Here's how this can happen.
>
> (1) There is exactly one pending operation.
>
> (2) Brick1 completes the write first, and says so.
>
> (3) Client sends messages to all three, saying to decrement
> brick1's
> count.
>
> (4) All three bricks receive and process that message.
>
> (5) Brick1 fails.
>
> (6) Brick2 and brick3 complete the write, and say so.
>
> (7) Client tells all bricks to decrement remaining counts.
>
> (8) Brick2 and brick3 receive and process that message.
>
> (9) Brick1 is dead, so its counts for brick2/3 stay at one.
>
> (10) Brick2 and brick3 have quorum, with all-zero pending
> counters.
>
> (11) Client sends 0xd46 more writes to brick2 and brick3.
>
> Note that at no point did we lose quorum. Note also the tight
> timing
> required. If brick1 had failed an instant earlier, it would
> not have
> decremented its own counter. If it had failed an instant
> later, it
> would have decremented brick2's and brick3's as well. If
> brick1 had not
> finished first, we'd be in yet another scenario. If delayed
> changelog
> had been operative, the messages at (3) and (7) would have
> been combined
> to leave us in yet another scenario. As far as I can tell, we
> would
> have been able to resolve the conflict in all those cases.
> *** Key point: quorum enforcement does not totally eliminate split
> brain. It only makes the frequency a few orders of magnitude
> lower. ***
>
>
> Not quite right. After we fixed the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1066996, the only two
> possible ways to introduce split-brain are
> 1) if we have an implementation bug in changelog xattr marking, I
> believe that to be the case here.
> 2) Keep writing to the file from the mount then
> a) take brick 1 down, wait until at least one write is successful
> b) bring brick1 back up and take brick 2 down (self-heal should
> not happen) wait until at least one write is successful
> c) bring brick2 back up and take brick 3 down (self-heal should
> not happen) wait until at least one write is successful
>
> With outcast implementation case-2 will also be immune to
> split-brain errors.
>
> Then the only way we have split-brains in afr is implementation
> errors of changelog marking. If we test it thoroughly and fix such
> problems we can get it to be immune to split-brain :-).
>
> Pranith
>
> So, is there any way to prevent this completely? Some AFR
> enhancements,
> such as the oft-promised "outcast" feature[1], might have helped.
> NSR[2] is immune to this particular problem. "Policy based
> split brain
> resolution"[3] might have resolved it automatically instead of
> merely
> flagging it. Unfortunately, those are all in the future. For
> now, I'd
> say the best approach is to resolve the conflict manually and
> try to
> move on. Unless there's more going on than meets the eye,
> recurrence
> should be very unlikely.
>
> [1]
> http://www.gluster.org/community/documentation/index.php/Features/outcast
>
> [2]
> http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
>
> [3]
> http://www.gluster.org/community/documentation/index.php/Features/pbspbr
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140920/e1a251e0/attachment.html>
More information about the Gluster-users
mailing list