[Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

Wed Aug 15 09:25:44 UTC 2018

Le mercredi 15 août 2018 à 14:50 +0530, Sankarshan Mukhopadhyay a
écrit :
> Thank you for (a) addressing the issue and (b) this write up
> 
> Does the -infra team have a way to monitor disk space usage?

Munin:
http://munin.gluster.org/munin/rht.gluster.org/jenkins-el7.rht.gluster.
org/index.html#disk

Seems we did had notifications, but that was turned (by me) on May 2106
with a laconic "receiving too much of them for now". I guess it was
sending too much false positive and we didn't spend time to fix that.

I want to move it out of rackspace since a long time, since it can't
monitor the internal network, and also move to nagios for alerting
(since you can filter alert).

> On Wed, Aug 15, 2018 at 2:40 PM Michael Scherer <mscherer at redhat.com>
> wrote:
> > 
> > Hi folks,
> > 
> > So Gluster jenkins disk was full today (cause outages do not
> > respect
> > public holiday in India (Independance day) and France(Assumption)),
> > here is the post mortem for your reading pleasure
> > 
> > Date: 15/08/2018
> > 
> > Service affected:
> >   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
> > 
> > Impact:
> > 
> >   No jenkins job could be triggered.
> > 
> > Root cause:
> > 
> >   A disk full mainly because we got new jobs and more patches, so
> > regular growth.
> > 
> > Resolution:
> > 
> >   Increased the disk by 30G, and investigating if cleanup could be
> >   improved. This did require a reboot.
> > 
> > 
> > Involved people:
> > - misc
> > - nigel
> > 
> > Lessons learned
> > - What went well:
> >   - we had a documented process for that, and good enough to be
> > used by
> >     a tired admin.
> > 
> > - What went bad:
> >   - we weren't proactive enough to see that before it caused a
> > outage
> >   - 15 of August is a holiday for both France and India.
> > Technically,
> >     none of the infra team should have been up.
> > 
> > - When we were lucky
> >   - It was a day off in India, so few people were affected, except
> >     folks who continue to work on days off
> >   - Misc decided to go to work while being in Brno to take days off
> >     later
> > 
> > 
> > Timeline (in UTC)
> > 
> > - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> > https://lists.gluster.org/pipermail/gluster-infra/2018-August/00479
> > 5.ht
> > ml
> > 
> > - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> > away from laptop for Independence day celebration.
> > 
> > - 06:24 Misc do not hear the ding since he is asleep
> > 
> > - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/sh
> > ow_b
> > ug.cgi?id=1616160
> > 
> > - 06:56 Misc do not see the email since he is still asleep
> > 
> > - 07:13 Misc wake up, see a blinking light on the phone and ponder
> > about closing his eyes again. He look at it, and start to swear.
> > 
> > - 07:14 Investigation reveal that Jenkins partition is full (100%).
> > A
> > quick investigation do not yield any particular issues. The Jenkins
> > jobs are taking space and that's it.
> > 
> > - 07:19 After discussion with Nigel, it is decided to increase the
> > size
> > of the partition. Misc take a look at it, try to increase without
> > any
> > luck. The server is rebooted in case that's what was needed. Still
> > not
> > enough.
> > 
> > - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> > water make him remember that a documentation on that process do
> > exist:
> > 
> > https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_part
> > itio
> > n.html
> > 
> > - 07:30  Following the documentation, we discover that the
> > hypervisor
> > is now out of space for future increase. Looking at that will be
> > done
> > after the post mortem.
> > 
> > - 07:37 Jenkins is being restarted, with more space, and seems to
> > work
> > ok.
> > 
> > - 07:38 Misc rush to his hotel breakfast who close at 10.
> > 
> > - 09:09 Post mortem is finished and being sent
> > 
> > 
> > Action items:
> > - (misc) see what can be done for myrmicinae (the hypervisor where
> > jenkins is running) since there is no more space.
> > 
> > Potential improvement to make:
> > - we still need to have monitoring in place
> > - we need to move munin in the internal lan for looking at the
> > graph
> > for jenkins
> > - documentation regarding resizing could be clearer, notably on
> > volume
> > resizing part
> > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > _______________________________________________
> > Gluster-infra mailing list
> > Gluster-infra at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180815/2442ba56/attachment.sig>