[Gluster-users] Monitoring

Tue May 11 12:58:39 UTC 2010

I have a monitoring setup that has been working well for a while now. It uses Mon, which is very simple and has been around for ages. For others who want to use this (I can do a wiki article eventually), here's a brief howto. In my case, every box is both a server and client so I check for several things, but this can easily be changed. 

First, check that the glusterfsd process reports "running". Second, that I can connect to the server port. Third, that every gluster mount listed in /etc/fstab is currently mounted. If any of these things fail, then Mon calls an "alert" which tries to fix things. The alert will restart glusterfsd, reload the fuse module, run an ls on each mountpoint in /etc/fstab and try to unmount / remount any that respond with "Transport endpoint not connected" or are not mounted to begin with. 

To set this up, here is what you need:

Install Mon on every server that you want to check. In your mon.cf file we need to define a hostgroup that references localhost. The relevant part of my mon.cf file is as follows:

###################################

### global options
cfbasedir   = /cluster/mon ## note: yours will differ, this is a custom thing
pidfile     = /var/run/mon.pid
statedir    = /var/lib/mon/state.d
logdir      = /var/lib/mon/log.d
dtlogfile   = /var/lib/mon/log.d/downtime.log
alertdir    = /usr/lib/mon/alert.d
mondir      = /usr/lib/mon/mon.d
maxprocs    = 20
histlength  = 100
randstart   = 60s

hostgroup localhost localhost

watch localhost
    service gluster
        interval 30s
        randskew 5s
        monitor gluster.monitor
        period hr {12am-11pm}
        alert fix_gluster.alert
        alert mail.alert -S "GlusterFS monitor is reporting failures" person at yourdomain.com
        numalerts 3

############################

This defines a group of hosts that includes only localhost, a service called gluster that will run the monitor named gluster.monitor, and which will call the alert named fix_gluster.alert if the monitor finds a problem.

Now in the monitors directory (you may have to look for it, some distros keep it in a different place; look for all the "*.monitor" files), create this monitor script named gluster.monitor and make it executable:

############################################
## Check the server
## reports status "running" ?

MONITORS="/wherever/the/mon/monitors/are"

status=`/etc/init.d/glusterfsd status`

if [[ ! `echo $status | grep running` ]]; then
  exit 1
 fi

## can connect to server port?
gluster_server_port=`cat /etc/glusterfs/glusterfsd.vol | grep 'option transport.socket.listen-port' | awk '{print $3}'`
$MONITORS/tcp.monitor -p $gluster_server_port localhost

if [[ $? != 0 ]]; then
  exit 1
 fi

## Check the client mounts
for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'`
   do
     if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then
       #this mount isn't there
       exit 1
      fi

     ## a more detailed check could go here, like an ls or attempted write
     ## we are only checking for the mount at this point

   done

exit 0
####################################

Now in the alerts directory (look for all the "*.alert" files, usually adjacent to the monitors dir), create this alert script named fix_gluster.alert and make it executable:

####################################
# fix_gluster.alert

logger -t fix_gluster.alert "Attempting glusterfs repair:"

/etc/init.d/glusterfsd stop > /dev/null 2>&1
/etc/init.d/glusterfsd start > /dev/null 2>&1

modprobe fuse
mount -a

## get all gluster mount points listed in /etc/fstab
for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'`
 do
   logger -t fix_gluster.alert "Checking mountpoint $i..."
   ls -l $i > /dev/null 2>&1

   if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then
        logger -t fix_gluster.alert "$i not mounted, attempt remount"
        mount -a
      fi

   if [[ `ls -l $i | grep 'Transport endpoint is not connected'` ]]; then
        logger -t fix_gluster.alert "Mountpoint $i reports 'Transport endpoint is not connected'"
        umount $i > /dev/null 2>&1
        service glusterfsd stop > /dev/null 2>&1
        service glusterfsd start > /dev/null 2>&1
        mount -a
    fi

   ls -l $i > /dev/null 2>&1
     if [[ $? == 0 ]]; then
        logger -t fix_gluster.alert "Mountpoint $i repaired!"
      fi
 done

exit 0
#####################################

Now start Mon, and every 30 seconds it will check all your glusterfs server processes and mount points and fix them if they error out / disappear. Also it will email you every time this happens, and log it to syslog. Since most distros have mon in their repositories, this should be easy to install and setup. And once you have created these config files, it is a simple matter to copy them to any server you install gluster on. I personally use Mon for checking all kinds of things. It is so easy to modify the scripts that you have have unlimited flexibility in what you can do. The basic idea in Mon is that you have a monitor that does some kind of check on some group of nodes, and it can either exit 0 or 1. If it exits 1, any specified alerts will run next. If it exits 0, it waits until the next interval and runs again. Writing monitors and alerts is very easy as you can see form the scripts above. Bottom line though, this has worked very well for me in terms of restarting / remounting glusterfs whenever the need arises. 

Feel free to email / post any questions and I will try to help anyone who wants to implement this. 

Chris