[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)

Fri Nov 21 06:36:27 UTC 2014

On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com> wrote:
>
>
>----- Original Message -----
>> From: "Vince Loschiavo" <vloschiavo at gmail.com>
>> To: "gluster-users at gluster.org" <Gluster-users at gluster.org>
>> Sent: Wednesday, November 19, 2014 9:50:50 PM
>> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios
>related)
>> 
>> 
>> Hello Gluster Community,
>> 
>> I have been using the Nagios monitoring scripts, mentioned in the
>below
>> thread, on 3.5.2 with great success. The most useful of these is the
>self
>> heal.
>> 
>> However, I've just upgraded to 3.6.1 on the lab and the self heal
>daemon has
>> become quite aggressive. I continually get alerts/warnings on 3.6.1
>that
>> virt disk images need self heal, then they clear. This is not the
>case on
>> 3.5.2. This
>> 
>> Configuration:
>> 2 node, 2 brick replicated volume with 2x1GB LAG network between the
>peers
>> using this volume as a QEMU/KVM virt image store through the fuse
>mount on
>> Centos 6.5.
>> 
>> Example:
>> on 3.5.2:
>> gluster volume heal volumename info: shows the bricks and number of
>entries
>> to be healed: 0
>> 
>> On v3.5.2 - During normal gluster operations, I can run this command
>over and
>> over again, 2-4 times per second, and it will always show 0 entries
>to be
>> healed. I've used this as an indicator that the bricks are
>synchronized.
>> 
>> Last night, I upgraded to 3.6.1 in lab and I'm seeing different
>behavior.
>> Running gluster volume heal volumename info , during normal
>operations, will
>> show a file out-of-sync, seemingly between every block written to
>disk then
>> synced to the peer. I can run the command over and over again, 2-4
>times per
>> second, and it will almost always show something out of sync. The
>individual
>> files change, meaning:
>> 
>> Example:
>> 1st Run: shows file1 out of sync
>> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in
>sync (not
>> in the list)
>> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in
>sync
>> (not in the list).
>> ...
>> nth run: shows 0 files out of sync
>> nth+1 run: shows file 3 and 12 out of sync.
>> 
>> From looking at the virtual machines running off this gluster volume,
>it's
>> obvious that gluster is working well. However, this obviously plays
>havoc
>> with Nagios and alerts. Nagios will run the heal info and get
>different and
>> non-useful results each time, and will send alerts.
>> 
>> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to
>tune the
>> settings or change the monitoring method to get better results into
>Nagios.
>> 
>In 3.6.1 the way heal info command works is different from that in
>3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that
>might need healing. Currently, in 3.6.1, there isn't a method to
>distinguish between a file that is being healed and a file with
>on-going I/O while listing. Hence you see files with normal operation
>too listed in the output of heal info command.

How did that regression pass?!