[Gluster-users] Failover problems with gluster 3.8.8-1 (latest Debian stable)

Tue Feb 13 13:33:44 UTC 2018

I'm using gluster for a virt-store with 3x2 distributed/replicated
servers for 16 qemu/kvm/libvirt virtual machines using image files
stored in gluster and accessed via libgfapi.  Eight of these disk images
are standalone, while the other eight are qcow2 images which all share a
single backing file.

For the most part, this is all working very well.  However, one of the
gluster servers (azathoth) causes three of the standalone VMs and all 8
of the shared-backing-image VMs to fail if it goes down.  Any of the
other gluster servers can go down with no problems; only azathoth causes
issues.

In addition, the kvm hosts have the gluster volume fuse mounted and one
of them (out of five) detects an error on the gluster volume and puts
the fuse mount into read-only mode if azathoth goes down.  libgfapi
connections to the VM images continue to work normally from this host
despite this and the other four kvm hosts are unaffected.

It initially seemed relevant that I have the libgfapi URIs specified as
gluster://azathoth/..., but I've tried changing them to make the initial
connection via other gluster hosts and it had no effect on the problem.
Losing azathoth still took them out.

In addition to changing the mount URI, I've also manually run a heal and
rebalance on the volume, enabled the bitrot daemons (then turned them
back off a week later, since they reported no activity in that time),
and copied one of the standalone images to a new file in case it was a
problem with the file itself.  As far as I can tell, none of these
attempts changed anything.

So I'm at a loss.  Is this a known type of problem?  If so, how do I fix
it?  If not, what's the next step to troubleshoot it?

# gluster --version
glusterfs 3.8.8 built on Jan 11 2017 14:07:11
Repository revision: git://git.gluster.com/glusterfs.git

# gluster volume status
Status of volume: palantir
Gluster process                             TCP Port  RDMA Port  Online
Pid
------------------------------------------------------------------------------
Brick saruman:/var/local/brick0/data        49154     0          Y
10690
Brick gandalf:/var/local/brick0/data        49155     0          Y
18732
Brick azathoth:/var/local/brick0/data       49155     0          Y
9507 
Brick yog-sothoth:/var/local/brick0/data    49153     0          Y
39559
Brick cthulhu:/var/local/brick0/data        49152     0          Y
2682 
Brick mordiggian:/var/local/brick0/data     49152     0          Y
39479
Self-heal Daemon on localhost               N/A       N/A        Y
9614 
Self-heal Daemon on saruman.lub.lu.se       N/A       N/A        Y
15016
Self-heal Daemon on cthulhu.lub.lu.se       N/A       N/A        Y
9756 
Self-heal Daemon on gandalf.lub.lu.se       N/A       N/A        Y
5962 
Self-heal Daemon on mordiggian.lub.lu.se    N/A       N/A        Y
8295 
Self-heal Daemon on yog-sothoth.lub.lu.se   N/A       N/A        Y
7588 

Task Status of Volume palantir
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : c38e11fe-fe1b-464d-b9f5-1398441cc229
Status               : completed           

-- 
Dave Sherohman