[Gluster-users] info healed show files being healed all the time

Mon Apr 18 16:42:37 UTC 2016

Hi there.

I'm running a replicated volume with 2 bricks, using GlusterFS 3.5.3.
Both servers are running CentOS 7.0.

The GlusterFS volume is only used to store VM disks images (not a lot of
files, but big files). I'm trying to debug a problem where sometime, I/O
from the VM POV is completely stuck for several minutes (usually ~10
minutes, I'll open a new thread for this). While looking at the volume
status, I see something strange:

gluster vol heal vmstore info healed

This command should only show healed files, but I see tons of entries
(1024, which I think is the limit). Every files in use is listed, as if
self-heal would run continuously, eg:

[...]
2016-04-18 13:13:19 /qual/tse2k12_sys.qcow2
2016-04-18 13:23:19 /prod/compta2015_sys.qcow2
2016-04-18 13:23:19 /qual/tse2k12_sys.qcow2
2016-04-18 13:33:19 /prod/syslog_sys.qcow2
2016-04-18 13:33:19 /prod/ipasserelle_data.qcow2
2016-04-18 13:33:19 /qual/tse2k12_sys.qcow2
2016-04-18 13:43:19 /qual/wintest_sys.qcow2
2016-04-18 13:43:19 /qual/tse2k12_sys.qcow2
2016-04-18 13:43:20 /prod/ipasserelle_sys.qcow2
2016-04-18 13:53:19 /qual/tse2k12_sys.qcow2
2016-04-18 13:53:20 /qual/wintest_sys.qcow2
2016-04-18 14:03:20 /prod/tel_var.qcow2
2016-04-18 14:03:20 /qual/tse2k12_sys.qcow2
2016-04-18 14:13:19 /qual/tse2k12_sys.qcow2
2016-04-18 14:13:20 /prod/ipasserelle_data.qcow2
2016-04-18 14:13:20 /prod/ipasserelle_sys.qcow2
2016-04-18 14:13:21 /qual/report_sys.qcow2
2016-04-18 14:23:19 /qual/tse2k12_sys.qcow2
2016-04-18 14:23:21 /prod/ipasserelle_sys.qcow2
2016-04-18 14:33:20 /qual/parana_sys.qcow2
2016-04-18 14:33:21 /prod/ipasserelle_sys.qcow2
2016-04-18 14:33:22 /qual/tse2k12_sys.qcow2
2016-04-18 14:53:19 /prod/ipasserelle_var.qcow2
2016-04-18 14:53:19 /qual/tse2k12_sys.qcow2
2016-04-18 15:03:20 /qual/tse2k12_sys.qcow2
2016-04-18 15:03:20 /prod/tel_var.qcow2
2016-04-18 15:13:20 /prod/ipasserelle_data.qcow2
2016-04-18 15:13:20 /prod/vigo_sys.qcow2
2016-04-18 15:23:20 /prod/ipasserelle_data.qcow2
2016-04-18 15:33:19 /qual/tse2k12_sys.qcow2
2016-04-18 15:43:21 /prod/tel_var.qcow2
2016-04-18 15:53:20 /qual/tse2k12_sys.qcow2
2016-04-18 15:53:21 /qual/print_sys.qcow2
2016-04-18 16:03:20 /prod/ged_sys.qcow2
2016-04-18 16:13:21 /prod/ged_sys.qcow2
2016-04-18 16:13:22 /qual/tse2k12_sys.qcow2

Note: healed entries are not the same on brick1 and brick2

But gluster vol heal vmstore info never shows any heal in progress (I
run it every 15 minutes). As most of those files are big (between 2GB
and 2.8TB), if heal would take place, I'd notice it as it'd take several
hours to complete. I see nothing in /var/log/gluster related to heal
happening.

It seems to be very regular, 10 min interval, and each time a different
file is marked as healed.

I've digged a bit at what is running at this time, and I see that I have
a monitoring script which query S.M.A.R.T data exactly at this time. I
can't see any reason this could be linked to GlusterFS seeing healed
file while no heal took place, but that's the only thing I see running
at this interval.

Any idea what this could be ? What can make gluster think files are
healed while nothing is logged, and no data heal took place ?

Cheers
Daniel

-- 

Logo FWS

	*Daniel Berteaud*

FIREWALL-SERVICES SAS.
Société de Services en Logiciels Libres
Tel : 05 56 64 15 32
Visio : http://vroom.im/dani
/www.firewall-services.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160418/baaafadd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature_mail_fws.png
Type: image/png
Size: 14520 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160418/baaafadd/attachment.png>