[Gluster-infra] Post mortem for docs.gluster.org outage on 20 november 2018
mscherer at redhat.com
Tue Nov 20 14:39:07 UTC 2018
Date: 20 november 2018
Our automated certificate renewal system failed to renew docs.gluster.org certificate, resulting in a expired certificate
for around 6h. Our monitoring system decided detect the problem.
Some people would had to accept a insecure certificate to read the website
So, on the monitoring side, it seems that "something" did broke alerting. However, upon restart and testing, it seems to be working
fine now. However, the configuration by default do not seems to verify that the certificate is going to expire, and so verify only that the
port 443 is open and a ssl request can be negociated.
On the certificate renewal side, all is covered by ansible, and we do a automated run every night.
A manual run didn't show any error, so my analysis point toward a failure of the automation. Looking at ant-queen, our deploy server, it seems that a issue
on 2 internal builders (builder1 and builder31) created a deadlock when ansible tried to connect, and for some
reason, didn't timeout. In turn, this did result in several process waiting on those 2 servers.
Looking at the graph, we can see the problem started around 1 week ago:
Since our system will only trigger renewal if the certificate is going to expire in 1 week, this did result in the process not trying to renew
for more than 1 week, and so the certificate expired.
A quick look on builder1 and 31 show that the issue is likely due to regression testing. The command 'df' is blocked on builder1, and that's usually
a sign of "something went wrong with the test suite". A look at the existing process hint the gd2 test suite, since there is etcd2 still running,
and glusterfsd process too.
- misc ran the process manually, and the certificate got renewed
- misc restarted nagios and alert started to work
- misc went on a process cleaning spree, unlocking a achievement on Steam by stopping 70 of them in 1 command
What went well:
- people contacted us
- only 1 certificate got impacted
When we were lucky:
- this only impacted docs.gluster.org, and a user workaround did exist
What went bad:
- supervision didn't paged anyone
Timeline (in UTC):
05:00 the certificate expire.
09:30 misc decide to go to the office
09:50 misc arrive at the train station and get in the train, then connect on irc just in case
10:01 obnox ping misc on irc
10:02 misc say crap, take a look, confirm the issue
10:05 misc connect on ant-queen, run the deploy script after checking the 2 proxies are ok
10:07 misc see that the certificate got renewed and inspect ant-queen, see a bunch of process blocked on 2 servers
10:08 entering a tunnel, misc declare the issue be fixed and will look once in the office
Potential improvement to make:
- our supervision should check certificate validity. (should be easy)
- our supervision should also verify that the we do not have something weird on ant-queen (less easy)
- whatever caused nagios to fail should be investigated, and mitigated
- whatever caused ansible to fail should be investigated, and mitigated
- our gd2 test suite should clean itself in a more reliable way
Sysadmin, Community Infrastructure and Platform, OSAS
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 836 bytes
Desc: This is a digitally signed message part
More information about the Gluster-infra