[Gluster-users] easily provoked unrecoverable split brain
Alexis Huxley
alexishuxley at gmail.com
Sun Jun 19 15:31:08 UTC 2016
As per the quickstart guide, I'm setting up a replicated volume on two
test (KVM) VMs fiori2 and torchio2 as follows:
mkfs -t xfs -i size=512 -f /dev/vdb1 # on both
mount /dev/vdb1 /vol/brick0 # on both
gluster peer probe torchio2 # on fiori2
gluster peer probe fiori2 # on torchio2
mkdir /vol/brick0/vmimages # on both
gluster volume create vmimages replica 2 \
torchio2:/vol/brick0/vmimages
fiori2:/vol/brick0/vmimages # fiori2
mount -t glusterfs fiori2:/vmimages /mnt # on both
Then I pull the virtual network cable out of one host (with 'virsh
domif-setlink fiori2 vnet10 down') and then run:
ls /mnt # on both (wait for timeouts to elapse)
uname -n > /mnt/hostname # on both (create conflict)
Then I put the cable back, wait a bit and then run:
torchio2# cat /mnt/hostname
cat: /mnt/hostname: Input/output error
torchio2#
I'm deliberately trying to provoke split-brain, so this I/O error
is no surprise.
The real problem comes when I try to recover from it:
fiori2# gluster volume heal vmimages info
Brick torchio2:/vol/brick0/vmimages
/ - Is in split-brain
/hostname
Number of entries: 2
Brick fiori2:/vol/brick0/vmimages
/ - Is in split-brain
/hostname
Number of entries: 2
fiori2# gluster volume heal vmimages split-brain source-brick torchio2:/vol/brick0/vmimages
'source-brick' option used on a directory (gfid:00000000-0000-0000-0000-000000000001). Performing conservative merge.
Healing gfid:00000000-0000-0000-0000-000000000001 failed:Operation not permitted.
Healing gfid:73dce70e-bb3e-40a2-bec9-4741399b6b72 failed:Transport endpoint is not connected.
Number of healed entries: 0
fiori2#
and the I/O error remains.
I've also tried it the manual/fattr way, but that itself also
produces I/O errors:
fiori2# getfattr -d -m . -e hex /mnt/hostname
getfattr: /mnt/hostname: Input/output error
fiori2#
I've done some googling, but not turned up any references to
split-brain with "Operation not permitted" or "Transport endpoint is
not connected". Am I doing something wrong? Is this a known bug?
Is there a workaround?
For info, I'm using:
fiori2# cat /etc/issue
Ubuntu 16.04 LTS \n \l
fiori2# uname -a
Linux fiori2 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
fiori2# dpkg -l | grep gluster
ii glusterfs-client 3.7.6-1ubuntu1 amd64 clustered file-system (client package)
ii glusterfs-common 3.7.6-1ubuntu1 amd64 GlusterFS common libraries and translator modules
ii glusterfs-server 3.7.6-1ubuntu1 amd64 clustered file-system (server package)
fiori2#
I understand that two nodes are not optimal; occassional split-brain
is acceptable so long as I can recover from it. Up to now, for
a clustered filesystem on my VM servers, I've been using DRBD+OCFS2,
but the NFS3 interaction has been glitchy, so now I'm doing some
tests with GlusterFS.
Any advice gratefully received! Thanks!
Alexis
More information about the Gluster-users
mailing list