[Gluster-infra] Postmortem for Jenkins Outage on 20/07/18

Fri Jul 20 14:33:17 UTC 2018

Hello folks,

I had to take down Jenkins for some time today. The server ran out of space
and was silently ignoring Gerrit requests for new jobs. If you think one of
your jobs needed a smoke or regression run and it wasn't triggered, this is
the root cause. Please retrigger your jobs.

## Summary of Impact
Jenkins jobs not triggered intermittently in the last couple of days. At
the moment, we do not have numbers on how many developers were affected by
this. This would be mitigated slightly every day due to the rotation rules
we have in place causing issues only around evening IST when we retrigger
our regular nightly jobs.

## Timeline of Events.
July 19 evening: I've noticed since yesterday that occasionally Jenkins
would not trigger a job for a push. This was on the build-jobs repo. I
chalked it to a signal getting lost in the noise and decided to debug
later. I could trigger it manually, so I put as a thing to do in the
morning. Today morning, I found that jobs are getting triggered as they
should and could not notice anything untoward.

July 20 6:41 pm: Kotresh pinged me asking if there was a problem. I could
see the problem I noticed yesterday in his job. This time a manual trigger
did not work. Around the same time Raghavendra Gowdappa also hit the same
problem. I logged into the server to notice that the Jenkins partition was
out of space.

July 20 7:40 pm: Jenkins is back online completely. A retrigger of the two
failing jobs have been successful.

## Root Cause
* Out of disk space on the Jenkins partition on build.gluster.org
* The bugzilla-post did not delete old jobs and we had about 7000 jobs in
there consuming about 20G of space.
* clang-scan job consumes about 1G per job and we were storing about 30
days worth of archives.

## Resolution
* All centos6-regression jobs are now deleted. We moved over to
centos7-regression a while ago.
* We now only store 7 days of archives for bugzilla-post and clang-scan jobs

## Future Recommendation
* Our monitoring did not alert us about the disk being filled up on the
Jenkins node. Ideally, we should have gotten a warning when we were at
least 90% full so we could plan for additional capacity or look for
mistakes in patterns.
* All jobs need to have a property that discards old runs with the maxmium
of 90 days being kept in case it's absolutely needed. This is currently not
enforced by CI but we will plan to enforce it in the future.

-- 
nigelb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180720/43bf81fa/attachment.html>