[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)

Sat Nov 22 16:42:22 UTC 2014

Thank you for that information.

Are there plans to restore the previous functionality in a later release of
3.6.x? Or is this what we should expect going forward?

On Thu, Nov 20, 2014 at 11:24 PM, Anuradha Talur <atalur at redhat.com> wrote:

>
>
> ----- Original Message -----
> > From: "Joe Julian" <joe at julianfamily.org>
> > To: "Anuradha Talur" <atalur at redhat.com>, "Vince Loschiavo" <
> vloschiavo at gmail.com>
> > Cc: "gluster-users at gluster.org" <Gluster-users at gluster.org>
> > Sent: Friday, November 21, 2014 12:06:27 PM
> > Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios
> related)
> >
> >
> >
> > On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com>
> > wrote:
> > >
> > >
> > >----- Original Message -----
> > >> From: "Vince Loschiavo" <vloschiavo at gmail.com>
> > >> To: "gluster-users at gluster.org" <Gluster-users at gluster.org>
> > >> Sent: Wednesday, November 19, 2014 9:50:50 PM
> > >> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios
> > >related)
> > >>
> > >>
> > >> Hello Gluster Community,
> > >>
> > >> I have been using the Nagios monitoring scripts, mentioned in the
> > >below
> > >> thread, on 3.5.2 with great success. The most useful of these is the
> > >self
> > >> heal.
> > >>
> > >> However, I've just upgraded to 3.6.1 on the lab and the self heal
> > >daemon has
> > >> become quite aggressive. I continually get alerts/warnings on 3.6.1
> > >that
> > >> virt disk images need self heal, then they clear. This is not the
> > >case on
> > >> 3.5.2. This
> > >>
> > >> Configuration:
> > >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the
> > >peers
> > >> using this volume as a QEMU/KVM virt image store through the fuse
> > >mount on
> > >> Centos 6.5.
> > >>
> > >> Example:
> > >> on 3.5.2:
> > >> gluster volume heal volumename info: shows the bricks and number of
> > >entries
> > >> to be healed: 0
> > >>
> > >> On v3.5.2 - During normal gluster operations, I can run this command
> > >over and
> > >> over again, 2-4 times per second, and it will always show 0 entries
> > >to be
> > >> healed. I've used this as an indicator that the bricks are
> > >synchronized.
> > >>
> > >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different
> > >behavior.
> > >> Running gluster volume heal volumename info , during normal
> > >operations, will
> > >> show a file out-of-sync, seemingly between every block written to
> > >disk then
> > >> synced to the peer. I can run the command over and over again, 2-4
> > >times per
> > >> second, and it will almost always show something out of sync. The
> > >individual
> > >> files change, meaning:
> > >>
> > >> Example:
> > >> 1st Run: shows file1 out of sync
> > >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in
> > >sync (not
> > >> in the list)
> > >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in
> > >sync
> > >> (not in the list).
> > >> ...
> > >> nth run: shows 0 files out of sync
> > >> nth+1 run: shows file 3 and 12 out of sync.
> > >>
> > >> From looking at the virtual machines running off this gluster volume,
> > >it's
> > >> obvious that gluster is working well. However, this obviously plays
> > >havoc
> > >> with Nagios and alerts. Nagios will run the heal info and get
> > >different and
> > >> non-useful results each time, and will send alerts.
> > >>
> > >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to
> > >tune the
> > >> settings or change the monitoring method to get better results into
> > >Nagios.
> > >>
> > >In 3.6.1 the way heal info command works is different from that in
> > >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that
> > >might need healing. Currently, in 3.6.1, there isn't a method to
> > >distinguish between a file that is being healed and a file with
> > >on-going I/O while listing. Hence you see files with normal operation
> > >too listed in the output of heal info command.
> >
> > How did that regression pass?!
> Test cases to check this condition was not written in regression tests.
> >
>
> --
> Thanks,
> Anuradha.
>

-- 
-Vince Loschiavo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141122/f1b16c23/attachment.html>