[Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

Wed Aug 15 09:10:04 UTC 2018

Hi folks,

So Gluster jenkins disk was full today (cause outages do not respect
public holiday in India (Independance day) and France(Assumption)),
here is the post mortem for your reading pleasure

Date: 15/08/2018

Service affected:
  Jenkins for Gluster (jenkins-el7.rht.gluster.org)

Impact:

  No jenkins job could be triggered.

Root cause:

  A disk full mainly because we got new jobs and more patches, so
regular growth.

Resolution:

  Increased the disk by 30G, and investigating if cleanup could be  
  improved. This did require a reboot.

Involved people:
- misc
- nigel

Lessons learned
- What went well:
  - we had a documented process for that, and good enough to be used by
    a tired admin.

- What went bad:
  - we weren't proactive enough to see that before it caused a outage
  - 15 of August is a holiday for both France and India. Technically, 
    none of the infra team should have been up.

- When we were lucky
  - It was a day off in India, so few people were affected, except 
    folks who continue to work on days off
  - Misc decided to go to work while being in Brno to take days off
    later

Timeline (in UTC)

- 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
ml

- 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
away from laptop for Independence day celebration.

- 06:24 Misc do not hear the ding since he is asleep

- 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
ug.cgi?id=1616160 

- 06:56 Misc do not see the email since he is still asleep

- 07:13 Misc wake up, see a blinking light on the phone and ponder
about closing his eyes again. He look at it, and start to swear.

- 07:14 Investigation reveal that Jenkins partition is full (100%). A
quick investigation do not yield any particular issues. The Jenkins
jobs are taking space and that's it.

- 07:19 After discussion with Nigel, it is decided to increase the size
of the partition. Misc take a look at it, try to increase without any
luck. The server is rebooted in case that's what was needed. Still not
enough.

- 07:25 Misc go quickly shower to wake him up. The warm embrace of
water make him remember that a documentation on that process do exist:

https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
n.html

- 07:30  Following the documentation, we discover that the hypervisor
is now out of space for future increase. Looking at that will be done
after the post mortem.

- 07:37 Jenkins is being restarted, with more space, and seems to work
ok.

- 07:38 Misc rush to his hotel breakfast who close at 10.

- 09:09 Post mortem is finished and being sent

Action items:
- (misc) see what can be done for myrmicinae (the hypervisor where
jenkins is running) since there is no more space.

Potential improvement to make:
- we still need to have monitoring in place
- we need to move munin in the internal lan for looking at the graph
for jenkins
- documentation regarding resizing could be clearer, notably on volume
resizing part

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180815/f0870b15/attachment-0001.sig>