[Gluster-infra] Post mortem of 2018-08-23 (2 for the price of one)
mscherer at redhat.com
Thu Aug 23 21:05:41 UTC 2018
so we had 3 incidents in the last 24h, and while all of them are
different, they are also linked.
So we did face several issues, starting by gerrit showing error 500
last night, around 23h Paris.
That was https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , and did
result in a memory upgrade this morning.
Then we started to look at others issues that were uncovered while
investigating the first, and i tried to look at the size of the mail
queue. Usually, this is not a problem, but after adding swap, it did
become a issue.
So I started to look for a way to blacklist mail sent to
jenkins at build.gluster.org, first by routing this mail domain to
supercolony, then by changing postifx to drop the mail.
And then we got 2 issue at once, timeline in UTC
13:42 misc add a MX for build.gluster.org in the zone. To do that, the
dns zone was changed and build.gluster.org could no longer be a CNAME.
14:56 kaleb ping misc/nigel saying "there is a message about disk full
on that job"
15:00 misc click on the link to build.gluster.org, is greeted by a ssl
error about certificat. Seems the DNS now resolve build.gluster.org to
2 IP instead of 1
15:04 misc revert the DNS, cause no time to investigate.
15:05 misc figure the server has a full disk because the logs are
stored on /
15:07 misc also start to swear in 2 languages
15:18 a new partition with more space is created on
http.int.rht.gluster.org data is copied, httpd restarted, situation is
back to normal
- some build logs were lost (likely not much)
- for 1h, some people could have been randomly directed to the wrong
server when going to build.gluster.org
- for DNS, a wrong commit. The syntax did look correct (and was
verified), so I need to check why it did more than required.
- for the disk full, a increase of patches and a oversight on that
- dns got reverted
- new partition was added and data were copied
What went well:
- we were quickly able to resolve the issue thanks to automation
When we were lucky:
- the issue got detected fast by the same person who made the change
(DNS), and people (Kaleb) notified us as soon as something seemed weird
- none of us were in Vancouver facing a measle outbreak
What went bad
- still no monitoring
Potential improvement to make:
- add monitoring
- revise ressources usage
- prepare a template for post mortem
Sysadmin, Community Infrastructure and Platform, OSAS
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 836 bytes
Desc: This is a digitally signed message part
More information about the Gluster-infra