[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)

Joe Julian joe at julianfamily.org
Fri Nov 21 06:36:27 UTC 2014

On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com> wrote:
>> From: "Vince Loschiavo" <vloschiavo at gmail.com>
>> To: "gluster-users at gluster.org" <Gluster-users at gluster.org>
>> Sent: Wednesday, November 19, 2014 9:50:50 PM
>> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios
>> Hello Gluster Community,
>> I have been using the Nagios monitoring scripts, mentioned in the
>> thread, on 3.5.2 with great success. The most useful of these is the
>> heal.
>> However, I've just upgraded to 3.6.1 on the lab and the self heal
daemon has
>> become quite aggressive. I continually get alerts/warnings on 3.6.1
virt disk images need self heal, then they clear. This is not the case on
case on
>> 3.5.2. This
>> Configuration:
>> 2 node, 2 brick replicated volume with 2x1GB LAG network between the
>> using this volume as a QEMU/KVM virt image store through the fuse
mount on
>> Centos 6.5.
>> Example:
>> on 3.5.2:
>> gluster volume heal volumename info: shows the bricks and number of
>> to be healed: 0
>> On v3.5.2 - During normal gluster operations, I can run this command
over and
>> over again, 2-4 times per second, and it will always show 0 entries
to be
>> healed. I've used this as an indicator that the bricks are
>> Last night, I upgraded to 3.6.1 in lab and I'm seeing different
>> Running gluster volume heal volumename info , during normal
operations, will
>> show a file out-of-sync, seemingly between every block written to
disk then
>> synced to the peer. I can run the command over and over again, 2-4
times per
>> second, and it will almost always show something out of sync. The
>> files change, meaning:
>> Example:
>> 1st Run: shows file1 out of sync
>> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in
sync (not
>> in the list)
>> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in
(not in the list).
>> ...
>> nth run: shows 0 files out of sync
>> nth+1 run: shows file 3 and 12 out of sync.
>> From looking at the virtual machines running off this gluster volume,
>> obvious that gluster is working well. However, this obviously plays
>> with Nagios and alerts. Nagios will run the heal info and get
different and
>> non-useful results each time, and will send alerts.
>> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to
tune the
>> settings or change the monitoring method to get better results into
>In 3.6.1 the way heal info command works is different from that in
>3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that
>might need healing. Currently, in 3.6.1, there isn't a method to
>distinguish between a file that is being healed and a file with
>on-going I/O while listing. Hence you see files with normal operation
>too listed in the output of heal info command.

How did that regression pass?!

