[Gluster-users] Trying to debug an IO hanging situation

Wed Apr 20 14:13:23 UTC 2016

Hi.

I've been trying to find out what's going on for several days now, but
can't find anything myself, so I'm asking for some help with GlusterFS
experts ;-)

I'm running 3 replicated gluster volumes between 2 nodes (each node
hosting 3 bricks: one per volume). Components involved:

- CentOS 7.0 x86_64 / 3.10.0-123.20.1
- GlusterFS 3.5.3

(yes, I should upgrade, I know).

This is used to host qemu-kvm VM. (1 GlusterFS volume for VM images, 1
for libvirt locks, 1 for VM states, eg virsh save vm1 can be restored on
the other node). The VM are hosted on the GlusterFS server itself (each
node fuse-mount the storage volume on /var/lib/libvirt/images). So they
are both GlusterFS server and client. VM are running only on the first
node (but can be live migrated to the second one in case of problem).

The 3 volumes (vmstore, save and locks) have the same configuration:

[root at master1 ~]# gluster vol info vmstore

Volume Name: vmstore
Type: Replicate
Volume ID: 7ed967f1-3b33-46d7-8908-0bb78c6e9199
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: master1:/mnt/bricks/vmstore
Brick2: master2:/mnt/bricks/vmstore
Options Reconfigured:
diagnostics.client-log-level: DEBUG
diagnostics.brick-log-level: INFO
cluster.eager-lock: on
network.frame-timeout: 300
network.ping-timeout: 20
nfs.disable: on

This setup worked well for more than a year, but had a big failure 3
months ago: all my VM had a kernel panic because they couldn't access
their storage anymore. Looking at my logs, I saw that gluster fuse
client lost connection with both bricks because they had not responded
for more than 5 sec (which was the network.ping-timeout at this time). I
don't really understand how this could happen as the network was OK, and
anyway, one of the bricks is running on 127.0.0.1 so definitely not a
network issue. I've increased network.ping-timeout to 20 sec, which
allowed all my VM to be started again without connection to bricks being
lost.

Now, things are working, but since this day, I have random IO hanging
from time to time. When the problem occurs, all IO in all the VM is
hanged, the load on the hypervisor (which is also the GlusterFS client
and one of the bricks) goes crazy (I've seen up to ~120). The load goes
so high I can't do anything on the hypervisor, I loose my SSH access
which doesn't respond anymore. The problem last for 5 or 10 minutes,
then everything start working again (Some VM doesn't like being stuck
for that long and need to be restarted).

The problem is very random, can happen every 2 days, as everything can
be working without a single issue for more than 3 weeks. It doesn't
depend on the load, nor on the access pattern.

I suspect something in Gluster to be the culprit, but I can't find
anything. I've enabled DEBUG logging on the client (but not on the brick
as it just too verbose), and will see if I can get more info next time
the issue happens.

I first noticed the problem always happened when I executed a monitoring
script (which executed several gluster commands and parsed it's output
to check the different volume status, script available here [1] if
anyone is interested), but I've now completely disabled monitoring, and
I still have this random issue.

A strange thing I've noticed is that the main volume (the one storing
the VM images) continuously shows files being healed if I look at:

gluster vol heal vmstore info healed

I see every 10 (exactly 10) minutes a few VM images being healed. But
nothing in the client logs, nor the system loads indicate heal taking
place.

I'm lost and don't know where to look, I'd really appreciate some help :-)

(we're ready to hire a GlusterFS expert to help us sorting this out if
necessary, this is a critical installation for us)

[1]:
https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD
-- 

Logo FWS

	*Daniel Berteaud*

FIREWALL-SERVICES SAS.
Société de Services en Logiciels Libres
Tel : 05 56 64 15 32
Visio : http://vroom.im/dani
/www.firewall-services.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160420/70b57640/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature_mail_fws.png
Type: image/png
Size: 14520 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160420/70b57640/attachment.png>