[Gluster-users] Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

Henrik Juul Pedersen hjp at liab.dk
Wed Dec 20 18:26:37 UTC 2017


Hi,

I have the following volume:

Volume Name: virt_images
Type: Replicate
Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
Status: Started
Snapshot Count: 2
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: virt3:/data/virt_images/brick
Brick2: virt2:/data/virt_images/brick
Brick3: printserver:/data/virt_images/brick (arbiter)
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.barrier: disable
features.scrub: Active
features.bitrot: on
nfs.rpc-auth-allow: on
server.allow-insecure: on
user.cifs: off
features.shard: off
cluster.shd-wait-qlength: 10000
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
nfs.disable: on
transport.address-family: inet
server.outstanding-rpc-limit: 512

After a server reboot (brick 1) a single file has become unavailable:
# touch fedora27.qcow2
touch: setting times of 'fedora27.qcow2': Input/output error

Looking at the split brain status from the client side cli:
# getfattr -n replica.split-brain-status fedora27.qcow2
# file: fedora27.qcow2
replica.split-brain-status="The file is not under data or metadata split-brain"

However, in the client side log, a split brain is mentioned:
[2017-12-20 18:05:23.570762] E [MSGID: 108008]
[afr-transaction.c:2629:afr_write_txn_refresh_done]
0-virt_images-replicate-0: Failing SETATTR on gfid
7a36937d-52fc-4b55-a932-99e2328f02ba: split-brain observed.
[Input/output error]
[2017-12-20 18:05:23.576046] W [MSGID: 108027]
[afr-common.c:2733:afr_discover_done] 0-virt_images-replicate-0: no
read subvols for /fedora27.qcow2
[2017-12-20 18:05:23.578149] W [fuse-bridge.c:1153:fuse_setattr_cbk]
0-glusterfs-fuse: 182: SETATTR() /fedora27.qcow2 => -1 (Input/output
error)

= Server side

No mention of a possible split brain:
# gluster volume heal virt_images info split-brain
Brick virt3:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

Brick virt2:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

Brick printserver:/data/virt_images/brick
Status: Connected
Number of entries in split-brain: 0

The info command shows the file:
]# gluster volume heal virt_images info
Brick virt3:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick virt2:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick printserver:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1


The heal and heal full commands does nothing, and I can't find
anything in the logs about them trying and failing to fix the file.

Trying to manually resolve the split brain from cli gives the following:
# gluster volume heal virt_images split-brain source-brick
virt3:/data/virt_images/brick /fedora27.qcow2
Healing /fedora27.qcow2 failed: File not in split-brain.
Volume heal failed.

The attrs from virt2 and virt3 are as follows:
[root at virt2 brick]# getfattr -d -m . -e hex fedora27.qcow2
# file: fedora27.qcow2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.virt_images-client-1=0x000002280000000000000000
trusted.afr.virt_images-client-3=0x000000000000000000000000
trusted.bit-rot.version=0x1d000000000000005a3aa0db000c6563
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri.1=0x00000000a49eb0000000000000000001
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001

# file: fedora27.qcow2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.virt_images-client-2=0x000003ef0000000000000000
trusted.afr.virt_images-client-3=0x000000000000000000000000
trusted.bit-rot.version=0x19000000000000005a3a9f82000c382a
trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri.1=0x00000000a2fbe0000000000000000001
trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001

I don't know how to find similar information from the arbiter...

Versions are the same on all three systems:
# glusterd --version
glusterfs 3.12.2

# gluster volume get all cluster.op-version
Option                                  Value
------                                  -----
cluster.op-version                      31202

I might try upgrading to version 3.13.0 tomorrow, but I want to hear
you out first.

How do I fix this? Do I have to manually change the file attributes?

Also, in the guides for manual resolution through setfattr, all the
bricks are listed with a "trusted.afr.<volume>-client-<brick>". But in
my system (as can be seen above), I only see the other bricks? So
which attributes should be changes into what?



I hope someone might know a solution. If you need any more information
I'll try and provide it. I can probably change the virtual machine to
another image for now.

Best regards,
Henrik Juul Pedersen
LIAB ApS


More information about the Gluster-users mailing list