[Bugs] [Bug 1496321] New: [afr] split-brain observed on T files post hardlink and rename in x3 volume

bugzilla at redhat.com bugzilla at redhat.com
Wed Sep 27 04:18:33 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1496321

            Bug ID: 1496321
           Summary: [afr] split-brain observed on T files post hardlink
                    and rename in x3 volume
           Product: GlusterFS
           Version: 3.10
         Component: replicate
          Keywords: Triaged
          Severity: urgent
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org, nchilaka at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1482812, 1491670
            Blocks: 1496317



+++ This bug was initially created as a clone of Bug #1491670 +++

+++ This bug was initially created as a clone of Bug #1482812 +++

Description of problem:
=======================

I have 4x3 volume where bricks were brought down in random order ensuring to
keep 2 bricks online all the time. However, I see lot of split-brains on the
system, when I looked into one of the file it seems to be coming from the
hashed link of hardlink file as follows (all bricks blaming each other). Also,
The files are accessible (ls,stat,cat) from mount and no EIO is seen.

getfattr form subvolume:
========================

[root at dhcp42-79 ~]#
[root at dhcp42-79 ~]# getfattr -d -e hex -m .
/rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file:
rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bdbea
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root at dhcp42-79 ~]#

[root at dhcp42-79 ~]# getfattr -d -e hex -m .
/rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file:
rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x599593a600020657
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root at dhcp42-79 ~]#


[root at dhcp43-210 ~]# getfattr -d -e hex -m .
/rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file:
rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000010000000200000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000befab
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root at dhcp43-210 ~]#

[root at dhcp42-79 ~]# ls -l
/rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
---------T. 4 root root 0 Aug 17 12:16
/rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root at dhcp42-79 ~]#



getfattr from cached subvolume:
===============================


Note: The files are accessible (ls,stat,cat) from mount and no EIO is seen. If
the file is in split-brain, isnt it be shown as EIO. May be because these
split-brains are on hashed files of hardlinks and no split-brain is on the
actual cached file of hardlinks.


[root at dhcp41-217 ~]# ls -l
/rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
-rw-r--r--. 6 root root 9537 Aug 17 12:14
/rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root at dhcp41-217 ~]# getfattr -d -e hex -m .
/rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file:
rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bb080

[root at dhcp41-217 ~]#


IO Pattern while the bricks were brought down:
==============================================

for i in
{create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,chown,create,hardlink,hardlink,symlink};
do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text
--fop=$i <mnt> ; sleep 10 ; done

Order of bringing the bricks down:
==================================

Subvolume 0: bricks {1,2,9}
Subvolume 1: bricks {3,4,10}
Subvolume 2: bricks {5,6,11}
Subvolume 3: bricks {7,8,12}

=> Bricks were brought down: 1, 11, 4 12 => One from each subvolume while IO is
inprogress
=> Bring the bricks back and wait for heal to complete
=> Bring the other set of bricks down: 5, 2 , 10, 8 => One from each subvolume
while IO is inprogress
=> Bring the bricks back and did not wait for heal to complete
=> Bring the final set of bricks down: 3, 6, 7, 9 => One from each subvolume
while IO is inprogress


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64



How reproducible:
=================

2/2

--- Additional comment from Ravishankar N on 2017-08-18 04:29:32 EDT ---

Volume Name: master
Type: Distributed-Replicate
Volume ID: c9a04941-4045-4bc1-bb26-131f5634a792
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick1/b2
Brick3: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick3/b9
Brick4: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick1/b3
Brick5: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick1/b4
Brick6: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick3/b10
Brick7: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick2/b5
Brick8: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick2/b6
Brick9: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick3/b11
Brick10: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick2/b7
Brick11: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick2/b8
Brick12: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick3/b12
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
nfs.disable: off
cluster.enable-shared-storage: enable

--- Additional comment from Ravishankar N on 2017-08-18 11:07:39 EDT ---

I was able to hit this issue (T files ending up in split brain) like so:

1. Create  2 x 3 volume and disable all heals:
Brick1: 127.0.0.2:/home/ravi/bricks/brick1
Brick2: 127.0.0.2:/home/ravi/bricks/brick2
Brick3: 127.0.0.2:/home/ravi/bricks/brick3

Brick4: 127.0.0.2:/home/ravi/bricks/brick4
Brick5: 127.0.0.2:/home/ravi/bricks/brick5
Brick6: 127.0.0.2:/home/ravi/bricks/brick6

2. Create a file and 3 hardlinks to it from fuse mount.
#tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── HLINK1
├── HLINK3
└── HLINK7

All of these files hashed to the first dht subvol, i.e. replicate-0.

3. Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to
replicate-1 and a T file is created there.

4. Likewise rename HLINK2 and HLINK3 as will, killing brick5 and brick6
respectively each time. i.e. a different brick of the 2nd replica is down each
time.

5. Now enable shd and let selfheals complete.

6. File names from the mount after rename:
[root at tuxpad ravi]# tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── NEW-HLINK1
├── NEW-HLINK3-NEW
└── NEW-HLINK7-NEW

7. The T files are now in split-brain:
[root at tuxpad ravi]# ll /home/ravi/bricks/brick{4..6}/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:59 /home/ravi/bricks/brick4/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick5/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick6/NEW-HLINK1
[root at tuxpad ravi]#
[root at tuxpad ravi]# getfattr -d -m . -e hex
/home/ravi/bricks/brick{4..6}/NEW-HLINK1
getfattr: Removing leading '/' from absolute path names
# file: home/ravi/bricks/brick4/NEW-HLINK1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick5/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick6/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

Heal-info also shows the T files to be in split-brain.

--- Additional comment from Worker Ant on 2017-09-14 08:11:13 EDT ---

REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for
zero-byte files) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-09-16 00:06:04 EDT ---

REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for
zero-byte files) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-09-26 00:04:23 EDT ---

COMMIT: https://review.gluster.org/18283 committed in master by Ravishankar N
(ravishankar at redhat.com) 
------
commit 1719cffa911c5287715abfdb991bc8862f0c994e
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Thu Sep 14 11:29:15 2017 +0530

    afr: auto-resolve split-brains for zero-byte files

    Problems:
    As described in BZ 1491670, renaming hardlinks can result in data/mdata
    split-brain of the DHT link-to files (T files) without any mismatch of
    data and metadata.

    As described in BZ 1486063, for a zero-byte file with only dirty bits
    set, arbiter brick will likely be chosen as the source brick.

    Fix:
    For zero byte files in split-brain, pick first brick as
    a) data source if file size is zero on all bricks.
    b) metadata source if metadata is the same on all bricks

    In arbiter case, if file size is zero on all bricks and there are no
    pending afr xattrs, pick 1st brick as data source.

    Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04
    BUG: 1491670
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>
    Reported-by: Rahul Hinduja <rhinduja at redhat.com>
    Reported-by: Mabi <mabi at protonmail.ch>

--- Additional comment from Ravishankar N on 2017-09-26 04:36:53 EDT ---

Sending an addendum to the patch in comment #3.

--- Additional comment from Worker Ant on 2017-09-26 04:37:24 EDT ---

REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in
afr_mark_source_sinks_if_file_empty) posted (#1) for review on master by
Ravishankar N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-09-26 04:41:26 EDT ---

REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in
afr_mark_source_sinks_if_file_empty) posted (#2) for review on master by
Ravishankar N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-09-26 23:03:44 EDT ---

COMMIT: https://review.gluster.org/18391 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Tue Sep 26 14:03:52 2017 +0530

    afr: don't check for file size in afr_mark_source_sinks_if_file_empty

    ... for AFR_METADATA_TRANSACTION and just mark source and sinks if
    metadata is the same.

    Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05
    BUG: 1491670
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1482812
[Bug 1482812] [afr] split-brain observed on T files post hardlink and
rename in x3 volume
https://bugzilla.redhat.com/show_bug.cgi?id=1491670
[Bug 1491670] [afr] split-brain observed on T files post hardlink and
rename in x3 volume
https://bugzilla.redhat.com/show_bug.cgi?id=1496317
[Bug 1496317] [afr] split-brain observed on T files post hardlink and
rename in x3 volume
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list