[Gluster-infra] Postmortem for yesterday's outage

Fri Sep 15 09:39:16 UTC 2017

Hello folks,

We had a brief outage yesterday that misc and I were working on fixing. We're
committed to doing a formal post-mortem of outages whether it affects everyone
or not, as a habit. Here's a post-mortem of yesterday's event.

## Affected Servers
* salt-master.rax.gluster.org
* syslog01.rax.gluster.org

## Total Duration
~4 hours

## What Happened
A few Rackspace servers depend on DHCP (default rackspace setup). Due to Centos
7.4 upgrade, we rebooted some server, since kernel and other packages were
upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone
wrong, or if Rackspace DHCP servers are at fault. We will be looking into this
in the coming days.

Michael had issues with the Rackspace console, Nigel stepped in to help with
the outage.

Once we accesses the machine via Emergency Console, we spent some trying to get
a DHCP lease. When that didn't work, we started working on setting up a static
IP and gateway. This took a few tries since the Rackspace documentation for
doing this was wrong. There's also a slight difference between "ip" and
"ifconfig" further creating confusion.

This is what we eventually did on one of the servers:
ip address add 162.209.109.18/24 dev eth0
route add default gw 162.209.109.1

This incident did not affect any of our critical services. Gerrit, Jenkins, and
download.gluster.org remained unaffected during this period.

We were limited in our ability to roll out any changes via ansible to these
servers during this ~4h window. We have a second server in progress for
deploying infrastructure but the setup is not ready yet. Manual roll-out from
sysadmins laptop was always possible in case of trouble.

## Timeline of Events
Note: All times in (CEDT)
* 09:00 am: Nigel and Michael are planning a new http server inside the cage for
           logs, packages, and Coverity scans.
* 10:00 am: Michael starts the ansible process to install the new server
* 12:10 am: The topic of Centos 7.4 upgrade come during discussion and Michael
            does an upgrade and reboot on the salt-master.rax.gluster.org.
* 12:15 pm: Michael notices that the salt-master server is not coming back.
            Nigel confirms.
* 12:15 pm: Nigel logs into Rackspace and does a hard restart on the
            salt-master machine. No luck.
* 12:34 pm: Nigel logs a ticket with Rackspace about the server outage.
* 12:44 pm: Nigel starts chat conversation with Rackspace support for
            escalation. Customer support engineer informs us that the server is
            up and can be accessed via Emergency Console.
* 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA of
            the isssue is a network problem caused by upgrade. Nigel confirms
            the RCA by verifying that eth0 does not have a public IP. Nigel
            tries to get the IP address to stick with the right gateway.
* 12:35 pm: Nigel manages to get salt-master online briefly.
* 13:34 pm: Nigel brings the salt-master back online.
* 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming up
            either
* 13:55 pm: Nigel brings back syslog back online as well.

## Pending Actions
* Michael to figure out if there is a bug in the new DHCP daemon, or if things
  changed Rackspace side.
* Michael to finish move of salt-master into the cage
  (ant-queen.int.rht.gluster.org) to prevent further issues.
* Nigel to send a note to Rackspace support to fix their documentation.

--
Nigel and Michael
Gluster Infra Team