[Gluster-infra] Post mortem for download.gluster.org outage on 23 october 2018

Tue Oct 23 13:36:19 UTC 2018

Date: 23 octobre 2018

Participating people:
- misc
- nigelb
- kaleb

Summary:

A DNS change to move download.gluster.org vhost from a server
in rackspace to a reverse proxy did result in a redirection loop, thus
breaking the service for some people during 30 minutes.

Impact: Some people using download.gluster.org

Root cause:

Download.gluster.org, as a vhost, was served out of
download01.rax.gluster.org, and in order to prepare for a move out of
Rackspace, the vhost was moved to the set of reverse proxy, as this
would permit a quick and painless switch from 1 server to the others
without the delay and unreliability of DNS update.

The move was scheduled for 10h UTC, but happened at 10h30, due to
various unrelated issues.

However, while the move was tested yesterday, the issue of SSL
redirection was missed, and since download01.rax.gluster.org was
forcing SSL, the proxy was going there using standard http, thus
resulting in a loop affecting firefox and curl.

While looking at that issue, it was also found that the proxies,
despites being ipv4 only hosts, where still trying to connect using
ipv6 from time to time to the backend server, resulting in either delay
and/or errors for clients. A quick fix was tried, but this did not fix
the problem, so IPv6 was removed from DNS.

Resolution:
- disable the ssl redirection on the backend server
- disable ipv6 on the proxy, and remove the ipv6 address

What went well:
- nginx was able to cope with the redirection loop quite well, cf the
graph:
   https://munin.gluster.org/munin/rht.gluster.org/proxy02.rht.gluster.
org/fw_conntrack.html
- the problem was quickly identified

When we were lucky:
- no one was in the office, and I didn't participate to the party in
the other office, so my lunch break

What went bad:
- supervision didn't detect anything, so I didn't got paged
- our logs are full of deprecated urls being hits, thus making harder
to notice a real issue

Timeline (in UTC):
11:52 nigelb and kkeithley ping misc on irc, as well as send him a email
12:08 misc see he got contacted and look at the issue
12:14 Issue is identified and a quick fix is tested
12:18 Fix is properly deployed and committed in the repo
12:25 Watching the log to see if all is well, a ipv6 issue is identified
12:26 A quick fix is tested for the ipv6, which do reduce errors but not much
12:29 IPv6 record is removed for the backend server

Potential improvement to make:
- nginx logs could be improved, work is on its way for that
- supervision should be able to detect this kind of incidents
- misc should have been clearer on way to contact in case of issue
- lots of people are using deprecated urls and locations, which is not great
- get ipv6 working

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20181023/2387f5e3/attachment.sig>