[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)

Fri Nov 21 07:24:37 UTC 2014


----- Original Message -----
> From: "Joe Julian" <joe at julianfamily.org>
> To: "Anuradha Talur" <atalur at redhat.com>, "Vince Loschiavo" <vloschiavo at gmail.com>
> Cc: "gluster-users at gluster.org" <Gluster-users at gluster.org>
> Sent: Friday, November 21, 2014 12:06:27 PM
> Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
> 
> 
> 
> On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com>
> wrote:
> >
> >
> >----- Original Message -----
> >> From: "Vince Loschiavo" <vloschiavo at gmail.com>
> >> To: "gluster-users at gluster.org" <Gluster-users at gluster.org>
> >> Sent: Wednesday, November 19, 2014 9:50:50 PM
> >> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios
> >related)
> >> 
> >> 
> >> Hello Gluster Community,
> >> 
> >> I have been using the Nagios monitoring scripts, mentioned in the
> >below
> >> thread, on 3.5.2 with great success. The most useful of these is the
> >self
> >> heal.
> >> 
> >> However, I've just upgraded to 3.6.1 on the lab and the self heal
> >daemon has
> >> become quite aggressive. I continually get alerts/warnings on 3.6.1
> >that
> >> virt disk images need self heal, then they clear. This is not the
> >case on
> >> 3.5.2. This
> >> 
> >> Configuration:
> >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the
> >peers
> >> using this volume as a QEMU/KVM virt image store through the fuse
> >mount on
> >> Centos 6.5.
> >> 
> >> Example:
> >> on 3.5.2:
> >> gluster volume heal volumename info: shows the bricks and number of
> >entries
> >> to be healed: 0
> >> 
> >> On v3.5.2 - During normal gluster operations, I can run this command
> >over and
> >> over again, 2-4 times per second, and it will always show 0 entries
> >to be
> >> healed. I've used this as an indicator that the bricks are
> >synchronized.
> >> 
> >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different
> >behavior.
> >> Running gluster volume heal volumename info , during normal
> >operations, will
> >> show a file out-of-sync, seemingly between every block written to
> >disk then
> >> synced to the peer. I can run the command over and over again, 2-4
> >times per
> >> second, and it will almost always show something out of sync. The
> >individual
> >> files change, meaning:
> >> 
> >> Example:
> >> 1st Run: shows file1 out of sync
> >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in
> >sync (not
> >> in the list)
> >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in
> >sync
> >> (not in the list).
> >> ...
> >> nth run: shows 0 files out of sync
> >> nth+1 run: shows file 3 and 12 out of sync.
> >> 
> >> From looking at the virtual machines running off this gluster volume,
> >it's
> >> obvious that gluster is working well. However, this obviously plays
> >havoc
> >> with Nagios and alerts. Nagios will run the heal info and get
> >different and
> >> non-useful results each time, and will send alerts.
> >> 
> >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to
> >tune the
> >> settings or change the monitoring method to get better results into
> >Nagios.
> >> 
> >In 3.6.1 the way heal info command works is different from that in
> >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that
> >might need healing. Currently, in 3.6.1, there isn't a method to
> >distinguish between a file that is being healed and a file with
> >on-going I/O while listing. Hence you see files with normal operation
> >too listed in the output of heal info command.
> 
> How did that regression pass?!
Test cases to check this condition was not written in regression tests.
> 

-- 
Thanks,
Anuradha.