[Gluster-infra] June 2016 Docs.gluster.org certificate expiration postmortem

Michael Scherer mscherer at redhat.com
Mon Jun 11 13:27:55 UTC 2018


Hi,

So since yesterday, docs.gluster.org certificate was expired. 

Date: 2018-06-11

Participating people:
- misc
- nigel

Summary:

docs.gluster.org certificate was expired, due to automation error not
renewing it. Since the certificate was on Lets Encrypt, it expire after
3 month and should have been renewed yesterday, but wasn't.

Impact:
- people add to accept a expired certificate for docs.gluster.org

Root cause:

So, investigation show multiple root causes. docs.gluster.org was using
the new proxy system (see http://lists.gluster.org/pipermail/gluster-in
fra/2018-February/004284.html ).

The automation was using a ansible module (openssl_certificate) that do
not seems to take in account renewal of certificate, at least not by
default. As the module was already fixed in the past (see https://githu
b.com/ansible/ansible/commits/devel/lib/ansible/modules/crypto/openssl_
certificate.py ), this was a oversight on my side after reading the
code. However, that's also one of the reason to not migrate too much
and see how renewal work. 

And since this requires a acme server, this was quite hard to add to
ansible CI in the first place.

Short answer, it doesn't work, and I will submit a patch upstream for
that.

On top of that, a few bugs were found on ansible causing problem:
- openssl_certificate used acme-tiny. And by default, acme-tiny didn't
download the intermediate certificate. But a quick reading of acme-tiny 
help showed a --chain option, so a patch was sent to ansible for that:
https://github.com/ansible/ansible/pull/35144

Turn out that the --chain option was a downstream patch for the
package, that got removed (not deprecated) in the latest rpm version:
https://src.fedoraproject.org/rpms/acme-tiny/c/ecd867acdf5380ade6874c16
0e8a00ce14d3f8ba?branch=master

So immediate consequence, the deployment of a new certificate failed
with a error.

- openssl_certificate do not properly handle acme-tiny failure, since
it seems to nonetheless create a file and consider it ok. I didn't
investigate that much, but that's also something to be fixed upstream.

- nginx do not detect the creation of new certificate, so a explicit
restart/reload need to be added to our playbooks when certificate are
renewed (once that's done in the module, of course). It doesn't happen
for the initial creation, so this wasn't seen earlier.

Resolution:
- screaming to relieve the existential crisis upon realisation that
breakage waited monday to happen, on my day back to work

- renewed the certificate semi manually in the mean time

- pushed https://github.com/gluster/gluster.org_ansible_configuration/c
ommit/fb8655c8d07948d2362d5e9213de001399bde06e as a workaround for now

- opened https://github.com/ansible/ansible/issues/41396 to get stuff
fixed upstream


What went well:
- only the docs website was impacted and it was seen quite fast
- it could be worked around by users

When we were lucky
- I was back in the office, awake enough and my phone wasn't out of
battery.


What went bad
- still no supervision for that on gluster side
- no one seems to have notified us
- we can't really count on Fedora policy to be applied (either there or
on EPEL), which is kinda making me sad


Timeline (in UTC)
11 June 2018
certificate expire

11h39: nigelb ping misc on irc. Misc is out for lunch, so he do not get the message
11h44: nigelb ping misc on telegram. Misc rush back to his desk, take proper music[1] and start to look at it while sipping his coffee.
11h53: issue is diagnosed, acme-tiny do not renew certificate for some reasons

1st fix is tried, remove the certificate, and restart ansible deployment ( 2 to 3 minutes each, kinda anticlimatic). 
It fail, with a error related to --chain. Misc remember this was some stuff that were fixed before

11h56 another attempt with ansible devel branch is done, assuming some bugfix weren't pushed back to stable branch

11h59 still fail. A quick hack is done to push

12h02 acme do not deploy anything, because files were already here. So certificates are removed again, and restarted

12h04 certificate are created. Nginx didn't got restarted so a restart on proxy01 and proxy02 is done. Stuff are back.


Potential improvement to make:
- fix all the stuff that need fixing
- cancel Monday for the rest of the week


[1] in this case, this was Hacknet OST. Great game, I recommend it.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180611/61fd21e0/attachment.sig>


More information about the Gluster-infra mailing list