<div dir="ltr">Not amusing that it&#39;s been closed because it was reported against a particular version of Gluster. Could somebody re-open it, please?</div><br><div class="gmail_quote"><div dir="ltr">On Wed, Aug 15, 2018 at 10:38 PM Nigel Babu &lt;<a href="mailto:nigelb@redhat.com">nigelb@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Wed, Aug 15, 2018 at 2:41 PM Michael Scherer &lt;<a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi folks,<br>

<br>

So Gluster jenkins disk was full today (cause outages do not respect<br>

public holiday in India (Independance day) and France(Assumption)),<br>

here is the post mortem for your reading pleasure<br>

<br>

Date: 15/08/2018<br>

<br>

Service affected:<br>

  Jenkins for Gluster (<a href="http://jenkins-el7.rht.gluster.org" rel="noreferrer" target="_blank">jenkins-el7.rht.gluster.org</a>)<br>

<br>

Impact:<br>

<br>

  No jenkins job could be triggered.<br>

<br>

Root cause:<br>

<br>

  A disk full mainly because we got new jobs and more patches, so<br>

regular growth.<br>

<br>

Resolution:<br>

<br>

  Increased the disk by 30G, and investigating if cleanup could be  <br>

  improved. This did require a reboot.<br>

<br>

<br>

Involved people:<br>

- misc<br>

- nigel<br>

<br>

Lessons learned<br>

- What went well:<br>

  - we had a documented process for that, and good enough to be used by<br>

    a tired admin.<br>

<br>

- What went bad:<br>

  - we weren&#39;t proactive enough to see that before it caused a outage<br>

  - 15 of August is a holiday for both France and India. Technically, <br>

    none of the infra team should have been up.<br>

<br>

- When we were lucky<br>

  - It was a day off in India, so few people were affected, except <br>

    folks who continue to work on days off<br>

  - Misc decided to go to work while being in Brno to take days off<br>

    later<br>

<br>

<br>

Timeline (in UTC)<br>

<br>

- 05:58 Amar post a mail to say &quot;smoke job fail&quot; on gluster-infra:<br>

<a href="https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.html" rel="noreferrer" target="_blank">https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht<br>

ml</a><br>

<br>

- 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is<br>

away from laptop for Independence day celebration.<br>

<br>

- 06:24 Misc do not hear the ding since he is asleep<br>

<br>

- 06:55 Sankarshan open a bug on it, <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1616160" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_b<br>

ug.cgi?id=1616160</a> <br>

<br>

- 06:56 Misc do not see the email since he is still asleep<br>

<br>

- 07:13 Misc wake up, see a blinking light on the phone and ponder<br>

about closing his eyes again. He look at it, and start to swear.<br>

<br>

- 07:14 Investigation reveal that Jenkins partition is full (100%). A<br>

quick investigation do not yield any particular issues. The Jenkins<br>

jobs are taking space and that&#39;s it.<br>

<br>

- 07:19 After discussion with Nigel, it is decided to increase the size<br>

of the partition. Misc take a look at it, try to increase without any<br>

luck. The server is rebooted in case that&#39;s what was needed. Still not<br>

enough.<br>

<br>

- 07:25 Misc go quickly shower to wake him up. The warm embrace of<br>

water make him remember that a documentation on that process do exist:<br>

<br>

<a href="https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partition.html" rel="noreferrer" target="_blank">https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio<br>

n.html</a><br>

<br>

- 07:30  Following the documentation, we discover that the hypervisor<br>

is now out of space for future increase. Looking at that will be done<br>

after the post mortem.<br>

<br>

- 07:37 Jenkins is being restarted, with more space, and seems to work<br>

ok.<br>

<br>

- 07:38 Misc rush to his hotel breakfast who close at 10.<br>

<br>

- 09:09 Post mortem is finished and being sent<br>

<br>

<br>

Action items:<br>

- (misc) see what can be done for myrmicinae (the hypervisor where<br>

jenkins is running) since there is no more space.<br>

<br>

Potential improvement to make:<br>

- we still need to have monitoring in place<br>

- we need to move munin in the internal lan for looking at the graph<br>

for jenkins<br>

- documentation regarding resizing could be clearer, notably on volume<br>

resizing part<br></blockquote><div><br></div><div>This is highlighting that we need to solve <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1564372" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1564372</a> on priority. The lack of monitoring is affecting day to day work.<br></div><div> </div></div>-- <br><div dir="ltr" class="m_-4459340920725543440gmail_signature"><div dir="ltr">nigelb<br></div></div></div>

</blockquote></div>