[Gluster-infra] Post mortem for docs.gluster.org outage on 20 november 2018

Tue Nov 20 14:39:07 UTC 2018

Date: 20 november 2018

Participating people:
- misc
- obnox

Summary:

Our automated certificate renewal system failed to renew docs.gluster.org certificate, resulting in a expired certificate
for around 6h. Our monitoring system decided detect the problem.

Impact:

Some people would had to accept a insecure certificate to read the website

Root cause:
So, on the monitoring side, it seems that "something" did broke alerting. However, upon restart and testing, it seems to be working
fine now. However, the configuration by default do not seems to verify that the certificate is going to expire, and so verify only that the
port 443 is open and a ssl request can be negociated.

On the certificate renewal side, all is covered by ansible, and we do a automated run every night.
A manual run didn't show any error, so my analysis point toward a failure of the automation. Looking at ant-queen, our deploy server, it seems that a issue 
on 2 internal builders (builder1 and builder31) created a deadlock when ansible tried to connect, and for some
reason, didn't timeout. In turn, this did result in several process waiting on those 2 servers. 

Looking at the graph, we can see the problem started around 1 week ago:

https://munin.gluster.org/munin/int.rht.gluster.org/builder1.int.rht.gluster.org/users.html

Since our system will only trigger renewal if the certificate is going to expire in 1 week, this did result in the process not trying to renew
for more than 1 week, and so the certificate expired.

A quick look on builder1 and 31 show that the issue is likely due to regression testing. The command 'df' is blocked on builder1, and that's usually 
a sign of "something went wrong with the test suite". A look at the existing process hint the gd2 test suite, since there is etcd2 still running, 
and glusterfsd process too.

Resolution:
- misc ran the process manually, and the certificate got renewed
- misc restarted nagios and alert started to work
- misc went on a process cleaning spree, unlocking a achievement on Steam by stopping 70 of them in 1 command

What went well:
- people contacted us
- only 1 certificate got impacted

When we were lucky:
- this only impacted docs.gluster.org, and a user workaround did exist

What went bad:
- supervision didn't paged anyone

Timeline (in UTC):

05:00 the certificate expire.
09:30 misc decide to go to the office
09:50 misc arrive at the train station and get in the train, then connect on irc just in case
10:01 obnox ping misc on irc
10:02 misc say crap, take a look, confirm the issue
10:05 misc connect on ant-queen, run the deploy script after checking the 2 proxies are ok
10:07 misc see that the certificate got renewed and inspect ant-queen, see a bunch of process blocked on 2 servers
10:08 entering a tunnel, misc declare the issue be fixed and will look once in the office

Potential improvement to make:
- our supervision should check certificate validity. (should be easy)
- our supervision should also verify that the we do not have something weird on ant-queen (less easy)
- whatever caused nagios to fail should be investigated, and mitigated
- whatever caused ansible to fail should be investigated, and mitigated 
- our gd2 test suite should clean itself in a more reliable way

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20181120/4b3bef82/attachment.sig>