[Gluster-infra] Post mortem for docs.gluster.org outage on 20 november 2018

Tue Nov 20 15:03:02 UTC 2018

Le mardi 20 novembre 2018 à 15:39 +0100, Michael Scherer a écrit :
> Date: 20 november 2018
> 
> Participating people:
> - misc
> - obnox
> 
> Summary:
> 
> Our automated certificate renewal system failed to renew
> docs.gluster.org certificate, resulting in a expired certificate
> for around 6h. Our monitoring system decided detect the problem.
> 
> Impact:
> 
> Some people would had to accept a insecure certificate to read the
> website
> 
> Root cause:
> So, on the monitoring side, it seems that "something" did broke
> alerting. However, upon restart and testing, it seems to be working
> fine now. However, the configuration by default do not seems to
> verify that the certificate is going to expire, and so verify only
> that the
> port 443 is open and a ssl request can be negociated.
> 
> On the certificate renewal side, all is covered by ansible, and we do
> a automated run every night.
> A manual run didn't show any error, so my analysis point toward a
> failure of the automation. Looking at ant-queen, our deploy server,
> it seems that a issue 
> on 2 internal builders (builder1 and builder31) created a deadlock
> when ansible tried to connect, and for some
> reason, didn't timeout. In turn, this did result in several process
> waiting on those 2 servers. 
> 
> Looking at the graph, we can see the problem started around 1 week
> ago:
>  
> https://munin.gluster.org/munin/int.rht.gluster.org/builder1.int.rht.
> gluster.org/users.html
> 
> Since our system will only trigger renewal if the certificate is
> going to expire in 1 week, this did result in the process not trying
> to renew
> for more than 1 week, and so the certificate expired.
> 
> A quick look on builder1 and 31 show that the issue is likely due to
> regression testing. The command 'df' is blocked on builder1, and
> that's usually 
> a sign of "something went wrong with the test suite". A look at the
> existing process hint the gd2 test suite, since there is etcd2 still
> running, 
> and glusterfsd process too.
> 
> 
> Resolution:
> - misc ran the process manually, and the certificate got renewed
> - misc restarted nagios and alert started to work
> - misc went on a process cleaning spree, unlocking a achievement on
> Steam by stopping 70 of them in 1 command
> 
> What went well:
> - people contacted us
> - only 1 certificate got impacted
> 
> When we were lucky:
> - this only impacted docs.gluster.org, and a user workaround did
> exist
> 
> What went bad:
> - supervision didn't paged anyone
> 
> 
> Timeline (in UTC):
> 
> 05:00 the certificate expire.
> 09:30 misc decide to go to the office
> 09:50 misc arrive at the train station and get in the train, then
> connect on irc just in case
> 10:01 obnox ping misc on irc
> 10:02 misc say crap, take a look, confirm the issue
> 10:05 misc connect on ant-queen, run the deploy script after checking
> the 2 proxies are ok
> 10:07 misc see that the certificate got renewed and inspect ant-
> queen, see a bunch of process blocked on 2 servers
> 10:08 entering a tunnel, misc declare the issue be fixed and will
> look once in the office
> 
> Potential improvement to make:
> - our supervision should check certificate validity. (should be easy)
> - our supervision should also verify that the we do not have
> something weird on ant-queen (less easy)

As I got paged because the load on ant-queen was too high, I think this
part is done.

> - whatever caused nagios to fail should be investigated, and
> mitigated

So nagios failed with out of memory since the 6 of november. While this
didn't result in a complete outage, it likely broken enough to create
some issues. 

A look at the graph show that starting around the 15, nagios didn't
check anything since there was no traffic:

https://munin.gluster.org/munin/int.rht.gluster.org/nagios.int.rht.glus
ter.org/fw_packets.html

Memory graph show what look like a memory leak:
https://munin.gluster.org/munin/int.rht.gluster.org/nagios.int.rht.glus
ter.org/memory.html

It seems to have started around the start of week 43, so around the 29
of october. The only change regarding packages is tzdata.

However, it could also be a side effect of this commit:
https://github.com/gluster/gluster.org_ansible_configuration/commit/01c
7e7120ea1cac27aa6d0cbcdf3da726a59c67c

I am going to investigate if it can be reverted and see.

> - whatever caused ansible to fail should be investigated, and
> mitigated 
> - our gd2 test suite should clean itself in a more reliable way
> 
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20181120/92dde6f7/attachment.sig>