[Gluster-infra] Postmortem for yesterday's outage
nigelb at redhat.com
Fri Sep 15 09:39:16 UTC 2017
We had a brief outage yesterday that misc and I were working on fixing. We're
committed to doing a formal post-mortem of outages whether it affects everyone
or not, as a habit. Here's a post-mortem of yesterday's event.
## Affected Servers
## Total Duration
## What Happened
A few Rackspace servers depend on DHCP (default rackspace setup). Due to Centos
7.4 upgrade, we rebooted some server, since kernel and other packages were
upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone
wrong, or if Rackspace DHCP servers are at fault. We will be looking into this
in the coming days.
Michael had issues with the Rackspace console, Nigel stepped in to help with
Once we accesses the machine via Emergency Console, we spent some trying to get
a DHCP lease. When that didn't work, we started working on setting up a static
IP and gateway. This took a few tries since the Rackspace documentation for
doing this was wrong. There's also a slight difference between "ip" and
"ifconfig" further creating confusion.
This is what we eventually did on one of the servers:
ip address add 220.127.116.11/24 dev eth0
route add default gw 18.104.22.168
This incident did not affect any of our critical services. Gerrit, Jenkins, and
download.gluster.org remained unaffected during this period.
We were limited in our ability to roll out any changes via ansible to these
servers during this ~4h window. We have a second server in progress for
deploying infrastructure but the setup is not ready yet. Manual roll-out from
sysadmins laptop was always possible in case of trouble.
## Timeline of Events
Note: All times in (CEDT)
* 09:00 am: Nigel and Michael are planning a new http server inside the cage for
logs, packages, and Coverity scans.
* 10:00 am: Michael starts the ansible process to install the new server
* 12:10 am: The topic of Centos 7.4 upgrade come during discussion and Michael
does an upgrade and reboot on the salt-master.rax.gluster.org.
* 12:15 pm: Michael notices that the salt-master server is not coming back.
* 12:15 pm: Nigel logs into Rackspace and does a hard restart on the
salt-master machine. No luck.
* 12:34 pm: Nigel logs a ticket with Rackspace about the server outage.
* 12:44 pm: Nigel starts chat conversation with Rackspace support for
escalation. Customer support engineer informs us that the server is
up and can be accessed via Emergency Console.
* 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA of
the isssue is a network problem caused by upgrade. Nigel confirms
the RCA by verifying that eth0 does not have a public IP. Nigel
tries to get the IP address to stick with the right gateway.
* 12:35 pm: Nigel manages to get salt-master online briefly.
* 13:34 pm: Nigel brings the salt-master back online.
* 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming up
* 13:55 pm: Nigel brings back syslog back online as well.
## Pending Actions
* Michael to figure out if there is a bug in the new DHCP daemon, or if things
changed Rackspace side.
* Michael to finish move of salt-master into the cage
(ant-queen.int.rht.gluster.org) to prevent further issues.
* Nigel to send a note to Rackspace support to fix their documentation.
Nigel and Michael
Gluster Infra Team
More information about the Gluster-infra