[Gluster-infra] Do we have a monitoring system on our builders?

Michael Scherer mscherer at redhat.com
Mon Apr 29 09:30:09 UTC 2019


Le lundi 29 avril 2019 à 10:17 +0200, Michael Scherer a écrit :
> Le samedi 27 avril 2019 à 22:18 +0300, Yaniv Kaul a écrit :
> > I'd like to see what is our status.
> > Just had CI failures[1] because builder26.int.rht.gluster.org is
> > not
> > available, apparently.
> 
> We have nagios too. Web interface is password protected so I can't
> give
> it (need to do a guest account, and so far, no one has expressed
> interest into that).

So, if folks want to see nagios, that's on nagios.gluster.org, login is
"guest", password is "gluster" (cf 
https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/nagios/tasks/httpd.yml#L24
 ). 

That's a guest account, so readonly (or so it should be).

> 
> This failure is weird, cause the builder is pretty much up and
> running,
> but it seems the jenkins agent crashed. This exact process is not
> monitored by nagios, as I was under the impression that jenkins was
> smart enough to start it on demand (seems I was wrong), and/or see it
> crashed and put the server out of rotation (seems I was wrong on that
> one too)
> 
> 
> I suspect this was related to the openjdk upgrade on the 20 on
> builder26. Since jenkins do not support that on the main server, I
> guess it also may be unstable on the agent side :/
> 
> I disconnected/reconnected the builder, this should fix for this one,
> but we definitely need to dig a bit more to see what happened and how
> to prevent that.
> 
> Adding supervision of the agent should be quick (*cough* famous last
> words *cough*), so let's do that as a first step.

Ok so after discussing with Deepshika who did fix a few servers
already, it seems the issue is not something that can be seen
externally, but only from within jenkins. The agent is running, but
failling, which make it tricky.


We have a few options, and I think that is one that could work (until
we move to on demand for all builders).

1) do not automatically upgrade the jdk 
(easy, drop a file to skip that file)

2) every week, run a jenkins job that
   - do the upgrade of openjdk
(permit to make sure we do not reboot a builder building something at
random)
   - reboot the builders

The trick part is that I do not know how to write that in a way that is
run:
- on all builders
- serially



-- 
Michael Scherer
Sysadmin, Community Infrastructure



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20190429/a1ad0e41/attachment.sig>


More information about the Gluster-infra mailing list