[Gluster-users] monitoring gluster replication status

Wed Jul 31 17:29:40 UTC 2013

We've been monitoring gluster with an icinga/nagios plugin we wrote that looks
at gluster volume status output (using --xml) to check that the volume and all
bricks are in the 'started' state.  Additionally, for 'replicate' volumes, we
look at 'gluster volume heal <vol> info' and 'gluster volume heal <vol> info
heal-failed' output.  We count the number of files waiting to be healed and
currently, if it's above 0 for 4 minutes it results in an alert.  For
heal-failed, we look at the list of failed entries and alert if any in the
list were from the previous 300 seconds (this list is a running log, rather
than a real-time list of files that are not healed).  We were also going to
check the 'heal <vol> info split-brain' output, but that appears to be just a
list of files without timestamps that doesn't get cleaned up after a file is
successfully healed after resolving the split-brain.  There's no way to know
if the issue is current or was from last month and has already been resolved.

What we have is better than nothing, but it's not ideal, IMO.  I wish there
were something more like NetApp's snapmirror status output where you could
monitor how many seconds behind replication was or a percentage complete
before being caught up.  Given that gluster is file-based it's not the same as
snapmirror's snapshot-based replication, so I understand the difficulties
here, but the idea is the same.  There's currently no way that I know of to
see how much gluster needs to replicate, how fast it's happening, how soon it
might be done, etc.  You just get a list of files that gluster thinks need to
get synced and that's about it.  Some of those files might need a full sync
while others might just need attributes changed or a block added.  Gluster
doesn't even really know that until it tries to actually heal the file, AFAIK.

Anyway, that's what we're currently doing.  It's at least something.  We get a
handful of false alerts if we're copying large files or lots of them for many
minutes and gluster hasn't had a chance to catch up entirely, but that's fine
for us.  For the most part, we do very few writes, so gluster replication is
always up-to-date.

I'd be curious to hear what others are doing to monitor gluster.

Todd

On Wed, Jul 31, 2013 at 12:35:05PM +0530, Sejal1 S wrote:
> Even I have similar requirement in my setup. 
> 
> Please suggest us the correct way to ensure the replication
> 
> 
> .Sejal
> 
> 
> 
> From:
> Matthew Sacks <msacksdandb at gmail.com>
> To:
> gluster-users at gluster.org
> Date:
> 31-07-2013 03:21
> Subject:
> [Gluster-users] monitoring gluster replication status
> Sent by:
> gluster-users-bounces at gluster.org
> 
> 
> 
> Hello,
> >From what I've seen there is no way I can monitor cluster status via munin 
> or any other method for that matter.
> 
> How can I ensure replication is working properly?
> 
> Thanks in advance,
> Matt_______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 

> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users