[Gluster-users] split-brain on glusterfs running with quorum on server and client

Sat Sep 20 03:19:26 UTC 2014

On 09/19/2014 09:58 PM, Ramesh Natarajan wrote:
> I was able to run another set of tests this week and I was able to 
> reproduce the issue again. Going by the extended attributes, I think i 
> ran into the same issue I saw earlier..
>
>  Do you think i need to open up a bug report?
hi Ramesh,
      I already fixed this bug. http://review.gluster.org/8757. We 
should have the fix in next 3.5.x release I believe.

Pranith
>
> Brick 1:
>
> trusted.afr.PL2-client-0=0x000000000000000000000000
> trusted.afr.PL2-client-1=0x000000010000000000000000
> trusted.afr.PL2-client-2=0x000000010000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
> Brick 2
>
> trusted.afr.PL2-client-0=0x0000125c0000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
> Brick 3
>
> trusted.afr.PL2-client-0=0x0000125c0000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c
>
>
> [root at ip-172-31-12-218 ~]# gluster volume info
> Volume Name: PL1
> Type: Replicate
> Volume ID: bd351bae-d467-4e8c-bbd2-6a0fe99c346a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol1/gluster-data
> Brick2: 172.31.16.220:/data/vol1/gluster-data
> Brick3: 172.31.12.218:/data/vol1/gluster-data
> Options Reconfigured:
> cluster.server-quorum-type: server
> network.ping-timeout: 12
> nfs.addr-namelookup: off
> performance.cache-size: 2147483648
> cluster.quorum-type: auto
> performance.read-ahead: off
> performance.client-io-threads: on
> performance.io-thread-count: 64
> cluster.eager-lock: on
> cluster.server-quorum-ratio: 51%
> Volume Name: PL2
> Type: Replicate
> Volume ID: e6ad8787-05d8-474b-bc78-748f8c13700f
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol2/gluster-data
> Brick2: 172.31.16.220:/data/vol2/gluster-data
> Brick3: 172.31.12.218:/data/vol2/gluster-data
> Options Reconfigured:
> nfs.addr-namelookup: off
> cluster.server-quorum-type: server
> network.ping-timeout: 12
> performance.cache-size: 2147483648
> cluster.quorum-type: auto
> performance.read-ahead: off
> performance.client-io-threads: on
> performance.io-thread-count: 64
> cluster.eager-lock: on
> cluster.server-quorum-ratio: 51%
> [root at ip-172-31-12-218 ~]#
>
> *Mount command*
>
> Client
>
> mount -t glusterfs -o 
> defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256 
> 172.31.16.220:/PL2  /mnt/vm
>
> Server
>
> /dev/xvdf    /data/vol1 xfs defaults,inode64,noatime 1 2
> /dev/xvdg   /data/vol2 xfs defaults,inode64,noatime 1 2
>
> *Packages*
>
> Client
>
> rpm -qa | grep gluster
> glusterfs-fuse-3.5.2-1.el6.x86_64
> glusterfs-3.5.2-1.el6.x86_64
> glusterfs-libs-3.5.2-1.el6.x86_64
>
> Server
>
> [root at ip-172-31-12-218 ~]# rpm -qa | grep gluster
> glusterfs-3.5.2-1.el6.x86_64
> glusterfs-fuse-3.5.2-1.el6.x86_64
> glusterfs-api-3.5.2-1.el6.x86_64
> glusterfs-server-3.5.2-1.el6.x86_64
> glusterfs-libs-3.5.2-1.el6.x86_64
> glusterfs-cli-3.5.2-1.el6.x86_64
> [root at ip-172-31-12-218 ~]#
>
>
> On Sat, Sep 6, 2014 at 9:01 AM, Pranith Kumar Karampuri 
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
>
>     On 09/06/2014 04:53 AM, Jeff Darcy wrote:
>
>             I have a replicate glusterfs setup on 3 Bricks ( replicate
>             = 3 ). I have
>             client and server quorum turned on. I rebooted one of the
>             3 bricks. When it
>             came back up, the client started throwing error messages
>             that one of the
>             files went into split brain.
>
>         This is a good example of how split brain can happen even with
>         all kinds of
>         quorum enabled.  Let's look at those xattrs.  BTW, thank you
>         for a very
>         nicely detailed bug report which includes those.
>
>             BRICK1
>             ========
>             [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
>             /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             getfattr: Removing leading '/' from absolute path names
>             # file:
>             data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             trusted.afr.PL2-client-0=0x000000000000000000000000
>             trusted.afr.PL2-client-1=0x000000010000000000000000
>             trusted.afr.PL2-client-2=0x000000010000000000000000
>             trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
>             BRICK 2
>             =======
>             [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
>             /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             getfattr: Removing leading '/' from absolute path names
>             # file:
>             data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             trusted.afr.PL2-client-0=0x00000d460000000000000000
>             trusted.afr.PL2-client-1=0x000000000000000000000000
>             trusted.afr.PL2-client-2=0x000000000000000000000000
>             trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>             BRICK 3
>             =========
>             [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
>             /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             getfattr: Removing leading '/' from absolute path names
>             # file:
>             data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
>             <tel:2014-09-05-17>_00_00
>             trusted.afr.PL2-client-0=0x00000d460000000000000000
>             trusted.afr.PL2-client-1=0x000000000000000000000000
>             trusted.afr.PL2-client-2=0x000000000000000000000000
>             trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
>         Here, we see that brick 1 shows a single pending operation for
>         the other
>         two, while they show 0xd46 (3398) pending operations for brick 1.
>         Here's how this can happen.
>
>         (1) There is exactly one pending operation.
>
>         (2) Brick1 completes the write first, and says so.
>
>         (3) Client sends messages to all three, saying to decrement
>         brick1's
>         count.
>
>         (4) All three bricks receive and process that message.
>
>         (5) Brick1 fails.
>
>         (6) Brick2 and brick3 complete the write, and say so.
>
>         (7) Client tells all bricks to decrement remaining counts.
>
>         (8) Brick2 and brick3 receive and process that message.
>
>         (9) Brick1 is dead, so its counts for brick2/3 stay at one.
>
>         (10) Brick2 and brick3 have quorum, with all-zero pending
>         counters.
>
>         (11) Client sends 0xd46 more writes to brick2 and brick3.
>
>         Note that at no point did we lose quorum. Note also the tight
>         timing
>         required.  If brick1 had failed an instant earlier, it would
>         not have
>         decremented its own counter.  If it had failed an instant
>         later, it
>         would have decremented brick2's and brick3's as well. If
>         brick1 had not
>         finished first, we'd be in yet another scenario.  If delayed
>         changelog
>         had been operative, the messages at (3) and (7) would have
>         been combined
>         to leave us in yet another scenario.  As far as I can tell, we
>         would
>         have been able to resolve the conflict in all those cases.
>         *** Key point: quorum enforcement does not totally eliminate split
>         brain.  It only makes the frequency a few orders of magnitude
>         lower. ***
>
>
>     Not quite right. After we fixed the bug
>     https://bugzilla.redhat.com/show_bug.cgi?id=1066996, the only two
>     possible ways to introduce split-brain are
>     1) if we have an implementation bug in changelog xattr marking, I
>     believe that to be the case here.
>     2) Keep writing to the file from the mount then
>     a) take brick 1 down, wait until at least one write is successful
>     b) bring brick1 back up and take brick 2 down (self-heal should
>     not happen) wait until at least one write is successful
>     c) bring brick2 back up and take brick 3 down (self-heal should
>     not happen) wait until at least one write is successful
>
>     With outcast implementation case-2 will also be immune to
>     split-brain errors.
>
>     Then the only way we have split-brains in afr is implementation
>     errors of changelog marking. If we test it thoroughly and fix such
>     problems we can get it to be immune to split-brain :-).
>
>     Pranith
>
>         So, is there any way to prevent this completely?  Some AFR
>         enhancements,
>         such as the oft-promised "outcast" feature[1], might have helped.
>         NSR[2] is immune to this particular problem.  "Policy based
>         split brain
>         resolution"[3] might have resolved it automatically instead of
>         merely
>         flagging it.  Unfortunately, those are all in the future.  For
>         now, I'd
>         say the best approach is to resolve the conflict manually and
>         try to
>         move on.  Unless there's more going on than meets the eye,
>         recurrence
>         should be very unlikely.
>
>         [1]
>         http://www.gluster.org/community/documentation/index.php/Features/outcast
>
>         [2]
>         http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
>
>         [3]
>         http://www.gluster.org/community/documentation/index.php/Features/pbspbr
>         _______________________________________________
>         Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140920/e1a251e0/attachment.html>