[Gluster-users] RFC: need better monitoring/failure signaling
Joe Landman
landman at scalableinformatics.com
Mon Feb 13 23:04:46 UTC 2012
Hi folks
Just had a failure this morning which didn't leave much in the way of
useful logs ... a gluster process started running up CPU and ignoring
input. No log details, and a simple kill and restart fixed it.
A few days ago, some compute node clients connected via Infiniband
could see 5 of 6 bricks, though all the rest of the systems could see
all 6. Restarting the mount (umount -l /path ; sleep 30 ; mount /path)
'fixed' it.
The problem was that no one knew that there was a problem, the logs
were (nearly) useless for problem determination. We had to look at the
overall system.
What I'd like to request comments and thoughts on, are whether or not
we can extract an external signal of some sort upon detection of an
issue. So in the event of a problem, an external program is run with an
error number, some text, etc. Sort of like what mdadm does for MD RAID
units. Alternatively, a nice simple monitoring port of some sort, which
we can open, and read until EOF, which reports current (error) state,
would be tremendously helpful.
What we are looking for is basically a way to monitor the system.
Not performance monitoring, but health monitoring.
Yes, we can work on a hacked up version of this ... I've done
something like this in the past. What we want is to figure out how to
expose enough of what we need to create a reasonable "health" monitor
for bricks.
I know there is a nagios plugin of some sort, and other similar
tools. What I am looking for is to get a discussion going on what the
capability for this should be minimally composed of. Given the layered
nature of gluster, it might be harder to pass errors up and down through
translator layers. But if we could connect something to the logging
system to specifically signal important events, to some place other than
the log, and do so in real time (again, the mdadm model is perfect),
then we are in good shape. I don't know if this is showing up in 3.3,
though this type of monitoring capability seems to be an obvious fit
going forward.
Unfortunately, this is something of a problem (monitoring gluster
health), and I don't see easy answers save building a log parser at this
time. So something that lets us periodically inquire as to volume/brick
health/availability would be (very) useful.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Gluster-users
mailing list