[Gluster-infra] [shadow-it] Portmortem for gluster jenkins disk full outage on the 15th of August

Thu Aug 16 14:46:14 UTC 2018

Not amusing that it's been closed because it was reported against a
particular version of Gluster. Could somebody re-open it, please?

On Wed, Aug 15, 2018 at 10:38 PM Nigel Babu <nigelb at redhat.com> wrote:

> On Wed, Aug 15, 2018 at 2:41 PM Michael Scherer <mscherer at redhat.com>
> wrote:
>
>> Hi folks,
>>
>> So Gluster jenkins disk was full today (cause outages do not respect
>> public holiday in India (Independance day) and France(Assumption)),
>> here is the post mortem for your reading pleasure
>>
>> Date: 15/08/2018
>>
>> Service affected:
>>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
>>
>> Impact:
>>
>>   No jenkins job could be triggered.
>>
>> Root cause:
>>
>>   A disk full mainly because we got new jobs and more patches, so
>> regular growth.
>>
>> Resolution:
>>
>>   Increased the disk by 30G, and investigating if cleanup could be
>>   improved. This did require a reboot.
>>
>>
>> Involved people:
>> - misc
>> - nigel
>>
>> Lessons learned
>> - What went well:
>>   - we had a documented process for that, and good enough to be used by
>>     a tired admin.
>>
>> - What went bad:
>>   - we weren't proactive enough to see that before it caused a outage
>>   - 15 of August is a holiday for both France and India. Technically,
>>     none of the infra team should have been up.
>>
>> - When we were lucky
>>   - It was a day off in India, so few people were affected, except
>>     folks who continue to work on days off
>>   - Misc decided to go to work while being in Brno to take days off
>>     later
>>
>>
>> Timeline (in UTC)
>>
>> - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
>> https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
>> ml
>> <https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.html>
>>
>> - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
>> away from laptop for Independence day celebration.
>>
>> - 06:24 Misc do not hear the ding since he is asleep
>>
>> - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
>> ug.cgi?id=1616160 <https://bugzilla.redhat.com/show_bug.cgi?id=1616160>
>>
>> - 06:56 Misc do not see the email since he is still asleep
>>
>> - 07:13 Misc wake up, see a blinking light on the phone and ponder
>> about closing his eyes again. He look at it, and start to swear.
>>
>> - 07:14 Investigation reveal that Jenkins partition is full (100%). A
>> quick investigation do not yield any particular issues. The Jenkins
>> jobs are taking space and that's it.
>>
>> - 07:19 After discussion with Nigel, it is decided to increase the size
>> of the partition. Misc take a look at it, try to increase without any
>> luck. The server is rebooted in case that's what was needed. Still not
>> enough.
>>
>> - 07:25 Misc go quickly shower to wake him up. The warm embrace of
>> water make him remember that a documentation on that process do exist:
>>
>> https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
>> n.html
>> <https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partition.html>
>>
>> - 07:30  Following the documentation, we discover that the hypervisor
>> is now out of space for future increase. Looking at that will be done
>> after the post mortem.
>>
>> - 07:37 Jenkins is being restarted, with more space, and seems to work
>> ok.
>>
>> - 07:38 Misc rush to his hotel breakfast who close at 10.
>>
>> - 09:09 Post mortem is finished and being sent
>>
>>
>> Action items:
>> - (misc) see what can be done for myrmicinae (the hypervisor where
>> jenkins is running) since there is no more space.
>>
>> Potential improvement to make:
>> - we still need to have monitoring in place
>> - we need to move munin in the internal lan for looking at the graph
>> for jenkins
>> - documentation regarding resizing could be clearer, notably on volume
>> resizing part
>>
>
> This is highlighting that we need to solve
> https://bugzilla.redhat.com/show_bug.cgi?id=1564372 on priority. The lack
> of monitoring is affecting day to day work.
>
> --
> nigelb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180816/25df49c2/attachment.html>