[Gluster-users] Healing queue rarely empty

Tue Dec 29 17:08:50 UTC 2015

-Atin
Sent from one plus one
On Dec 17, 2015 3:21 PM, "Nicolas Ecarnot" <nicolas at ecarnot.net> wrote:
>
> Le 17/12/2015 10:10, Nicolas Ecarnot a écrit :
>>
>> Hello,
>>
>> Our setup : 3 Centos 7.2 nodes, with gluster 3.7.6 in replica-3, used as
>> storage+compute for an oVirt 3.5.6 DC.
>>
>> Two days ago, we added some nagios/centreon monitoring watching every 5
>> minutes the state of the heal queue :
>> (something like "gluster volume heal some_vol info" with the adequate
>> grep).
>>
>> I expected the "Number of entries" of every node to appear in the graph
>> as a flat zero line, most of the times, except for the rare cases of
>> node reboot, after which healing is launched and takes some minutes
>> (sometimes hours) but is doing good.
>>
>> Instead, we see that the healing queue is doing 2 or 3 files healing say
>> 4 times an hour. All day long.
>>
>> Our DC is a small one, and has few VMs, so not more than only 8 big
>> files are stored in glusterfs.
>> I'm very surprised to see that these files constantly need healing, as I
>> thought I've understood that read/writes were synchronous at every time,
>> and replica-3 meant that every files were absolutely synced and commited
>> at all time.
>>
>> I've also read about the 10 minutes cron-like job of the self-healing
>> daemon, which we are using by default, but this is a second point.
>>
>> The first point leads to :
>> - Why do we see so frequent desynchronizations between nodes?
>> - Can I confirm that reading which logs?
>> - What must I check?
>>
>
> Self-replying, but as I found :
> https://www.mail-archive.com/gluster-users%40gluster.org/msg20611.html
>
> could this make sense to be surprised to see that :
>
> gluster volume get data cluster.op-version
> Option                                  Value
> ------                                  -----
> cluster.op-version                      30600
>
> in a 3.7.6 gluster cluster?
That's normal as after upgrade an explicit op version bumping is required.
In this case post 3.6 op version was never bumped up.
>
> I have absolutely no idea of what this means nor how this changes
anything. But I see many things in my logs like :
>
> Server and Client lk-version numbers are not same, reopening the fds
This has nothing to do with op-version and glusterd.
>
> and
>
> many many errors in etc-glusterfs-glusterd.vol.log about
> missing options, other points like 'Unable to release lock', very
frequent vol reqs :
> http://pastebin.com/e6nQfeLx
Again this is expected if concurrent commands on same volume is executed
from different CLIs.
>
> What is op-version used for?
To be precise, it depicts what version the entire cluster can operate at.
>
>
> --
> Nicolas ECARNOT
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151229/acd69e43/attachment.html>