[Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

Wed Aug 15 09:20:07 UTC 2018

Thank you for (a) addressing the issue and (b) this write up

Does the -infra team have a way to monitor disk space usage?

On Wed, Aug 15, 2018 at 2:40 PM Michael Scherer <mscherer at redhat.com> wrote:
>
> Hi folks,
>
> So Gluster jenkins disk was full today (cause outages do not respect
> public holiday in India (Independance day) and France(Assumption)),
> here is the post mortem for your reading pleasure
>
> Date: 15/08/2018
>
> Service affected:
>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
>
> Impact:
>
>   No jenkins job could be triggered.
>
> Root cause:
>
>   A disk full mainly because we got new jobs and more patches, so
> regular growth.
>
> Resolution:
>
>   Increased the disk by 30G, and investigating if cleanup could be
>   improved. This did require a reboot.
>
>
> Involved people:
> - misc
> - nigel
>
> Lessons learned
> - What went well:
>   - we had a documented process for that, and good enough to be used by
>     a tired admin.
>
> - What went bad:
>   - we weren't proactive enough to see that before it caused a outage
>   - 15 of August is a holiday for both France and India. Technically,
>     none of the infra team should have been up.
>
> - When we were lucky
>   - It was a day off in India, so few people were affected, except
>     folks who continue to work on days off
>   - Misc decided to go to work while being in Brno to take days off
>     later
>
>
> Timeline (in UTC)
>
> - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
> ml
>
> - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> away from laptop for Independence day celebration.
>
> - 06:24 Misc do not hear the ding since he is asleep
>
> - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
> ug.cgi?id=1616160
>
> - 06:56 Misc do not see the email since he is still asleep
>
> - 07:13 Misc wake up, see a blinking light on the phone and ponder
> about closing his eyes again. He look at it, and start to swear.
>
> - 07:14 Investigation reveal that Jenkins partition is full (100%). A
> quick investigation do not yield any particular issues. The Jenkins
> jobs are taking space and that's it.
>
> - 07:19 After discussion with Nigel, it is decided to increase the size
> of the partition. Misc take a look at it, try to increase without any
> luck. The server is rebooted in case that's what was needed. Still not
> enough.
>
> - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> water make him remember that a documentation on that process do exist:
>
> https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
> n.html
>
> - 07:30  Following the documentation, we discover that the hypervisor
> is now out of space for future increase. Looking at that will be done
> after the post mortem.
>
> - 07:37 Jenkins is being restarted, with more space, and seems to work
> ok.
>
> - 07:38 Misc rush to his hotel breakfast who close at 10.
>
> - 09:09 Post mortem is finished and being sent
>
>
> Action items:
> - (misc) see what can be done for myrmicinae (the hypervisor where
> jenkins is running) since there is no more space.
>
> Potential improvement to make:
> - we still need to have monitoring in place
> - we need to move munin in the internal lan for looking at the graph
> for jenkins
> - documentation regarding resizing could be clearer, notably on volume
> resizing part
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra

-- 
sankarshan mukhopadhyay
<https://about.me/sankarshan.mukhopadhyay>