[Bugs] [Bug 1322850] New: Healing queue rarely empty

Thu Mar 31 12:38:28 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1322850

            Bug ID: 1322850
           Summary: Healing queue rarely empty
           Product: GlusterFS
           Version: mainline
         Component: replicate
          Keywords: Triaged
          Severity: medium
          Priority: medium
          Assignee: pkarampu at redhat.com
          Reporter: pkarampu at redhat.com
                CC: bugs at gluster.org, hgowtham at redhat.com,
                    nicolas at ecarnot.net
        Depends On: 1294675

+++ This bug was initially created as a clone of Bug #1294675 +++

Description of problem:
>From the command line of each host, and now constantly monitored by our
Nagios/Centreon setup, we see that our 3 nodes replica-3 gluster storage volume
is very frequently healing files, not to say constantly.

Version-Release number of selected component (if applicable):
Our setup : 3 Centos 7.2 nodes, with gluster 3.7.6 in replica-3, used as
storage+compute for an oVirt 3.5.6 DC.

How reproducible:
Install an oVirt setup on 3 nodes with glusterFS as direct gluster storage.
We have only 3 VMs running on it, so approx not more than 8 files (yes : only 8
files - the VM qemu files).

Steps to Reproduce:
1. Just run it and watch : all is nice
2. Run "gluster volume heal some_vol info" on random nodes
3. Read that more than zero files are getting healed

Actual results:
More than zero files are getting healed

Expected results:
I expected the "Number of entries" of every node to appear in the graph as a
flat zero line, most of the times, except for the rare cases of node reboot,
after which healing is launched and takes some minutes (sometimes hours) but is
doing good. 

Additional info:
At first, I found out that I forgot to bump up the cluster.op-version, but this
has been done, everything rebooted and back to up.
But this DC is very lightly used, and I'm sure the gluster clients (that are
the gluster nodes themselves) should read and write in a synchronous and proper
way, not leading to any healing need.

Please see :
https://www.mail-archive.com/gluster-users@gluster.org/msg22890.html

--- Additional comment from Pranith Kumar K on 2016-01-11 04:45:59 EST ---

hi Nicolas Ecarnot,
      Thanks for raising the bug. "gluster volume heal <volname> info" is
designed to be run one per the volume. If we run multiple processes it may lead
to "Possibly undergoing heal" messages as the two try to take same locks and
they will fail.

Pranith

--- Additional comment from Nicolas Ecarnot on 2016-01-11 04:48:11 EST ---

(In reply to Pranith Kumar K from comment #1)
> hi Nicolas Ecarnot,
>       Thanks for raising the bug. "gluster volume heal <volname> info" is
> designed to be run one per the volume. If we run multiple processes it may
> lead to "Possibly undergoing heal" messages as the two try to take same
> locks and they will fail.
> 
> Pranith

Thank you Pranith for your answer.

Do you advice us to setup our Nagios/Centreon to run only *ONE* check per
volume?
If so, please don't close this bug, let us change the setup, wait one week and
I'll report the result here.

Tell me.

--- Additional comment from Pranith Kumar K on 2016-01-18 05:56:32 EST ---

hi Nicolas Ecarnot,
      Sorry for the delay. Sure doing that will definitely help us. There could
still be one corner case of self-heal-daemon and heal info conflicting for same
locks. But I would like to hear more from you.

Pranith

--- Additional comment from Nicolas Ecarnot on 2016-01-18 08:35:28 EST ---

(In reply to Pranith Kumar K from comment #3)
> hi Nicolas Ecarnot,
>       Sorry for the delay. Sure doing that will definitely help us. There
> could still be one corner case of self-heal-daemon and heal info conflicting
> for same locks. But I would like to hear more from you.
> 
> Pranith

On january 12, 2016, we modified our Nagios/Centreon to offset the checks of
our 3 nodes'healing status.

2 weeks later, the graphs are showing a great decrease of healing cases, though
not null.
This sounds encouraging.

Being recently noticed about sharding, this is the next feature to try and see
whether it could improve the healing cases.
I let you decide if this is enough to close this bug - my opinion is that I'm
still surprised that the healing cases is *not* constantly zero, but you
choose.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1294675
[Bug 1294675] Healing queue rarely empty
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=eirutK6Ggr&a=cc_unsubscribe