[Gluster-infra] Testing watchdog on some nodes

Sun Sep 13 20:30:37 UTC 2015

Le dimanche 13 septembre 2015 à 21:50 +0200, Niels de Vos a écrit :
> On Sun, Sep 13, 2015 at 07:02:38PM +0200, Michael Scherer wrote:
> > Le mardi 08 septembre 2015 à 14:56 +0200, Michael Scherer a écrit :
> > > Le mardi 08 septembre 2015 à 11:21 +0200, Michael Scherer a écrit :
> > > > Hi,
> > > > 
> > > > so since some nodes are stuck and likely need a reboot, I will test
> > > > using software watchdog on them to reboot when there is a issue as a
> > > > stop gap until we find what break them (but breaking is likely to happen
> > > > anyway, that's why we have tests)
> > > > 
> > > > I will take slave21 and 22 as guinea pig, so please do not reboot them
> > > > if they are stuck (or if you do, tell me ). If anything weird happen on
> > > > them (like reboot during a test, or this kind of stuff, please tell me
> > > > too.
> > > >
> > > > I guess 2 to 3 weeks of tests should be enough to see if we can push
> > > > that to other centos 6 slaves.
> > > 
> > > Seems we are missing a module on centos for the kernel support. So it
> > > might be less efficient.
> > 
> > And slave22 didn't reboot as planned, so watchdog do not seems to work.
> 
> If a watchdog inside a VM does not work, it is most likely due to some
> kernel issues. You could also try to enable kdump or set some kernel
> options, for ideas see 'sysctl -a | grep panic'.

Yeah, the kernel level one is not enabled. So i did tried one on
userspace, but I guess it might have been killed as well by whatever
killed openssh and/or cron and/or the rest

> Maybe we can setup a watchdog in Jenkins and trigger the reboot-vm job
> (issues a rackspace API, not from within the VM) when the watchdog does
> not resond? I've seen an option somewhere to configure Jenkins jobs
> according to a schedule... But, maybe it is easier to have SaltStack
> monitor and reboot VMs?

I was looking at that, I have the first half of the script, now I just
need to find novaclient for EL7 (ie, find which channel to enable in
RHEL and what impact it has... )

> The Python script for rebooting VMs and a default configuration file can
> be found here:
> 
>   https://github.com/gluster/glusterfs-patch-acceptance-tests/tree/master/rax-reboot

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20150913/581d6672/attachment.sig>