<div dir="ltr"><br><div class="gmail_quote">Hello folks,<br>
<br>
We had a brief outage yesterday that misc and I were working on fixing. We're<br>
committed to doing a formal post-mortem of outages whether it affects everyone<br>
or not, as a habit. Here's a post-mortem of yesterday's event.<br>
<br>
## Affected Servers<br>
* <a href="http://salt-master.rax.gluster.org" rel="noreferrer" target="_blank">salt-master.rax.gluster.org</a><br>
* <a href="http://syslog01.rax.gluster.org" rel="noreferrer" target="_blank">syslog01.rax.gluster.org</a><br>
<br>
## Total Duration<br>
~4 hours<br>
<br>
## What Happened<br>
A few Rackspace servers depend on DHCP (default rackspace setup). Due to Centos<br>
7.4 upgrade, we rebooted some server, since kernel and other packages were<br>
upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone<br>
wrong, or if Rackspace DHCP servers are at fault. We will be looking into this<br>
in the coming days.<br>
<br>
Michael had issues with the Rackspace console, Nigel stepped in to help with<br>
the outage.<br>
<br>
Once we accesses the machine via Emergency Console, we spent some trying to get<br>
a DHCP lease. When that didn't work, we started working on setting up a static<br>
IP and gateway. This took a few tries since the Rackspace documentation for<br>
doing this was wrong. There's also a slight difference between "ip" and<br>
"ifconfig" further creating confusion.<br>
<br>
This is what we eventually did on one of the servers:<br>
ip address add <a href="http://162.209.109.18/24" rel="noreferrer" target="_blank">162.209.109.18/24</a> dev eth0<br>
route add default gw 162.209.109.1<br>
<br>
This incident did not affect any of our critical services. Gerrit, Jenkins, and<br>
<a href="http://download.gluster.org" rel="noreferrer" target="_blank">download.gluster.org</a> remained unaffected during this period.<br>
<br>
We were limited in our ability to roll out any changes via ansible to these<br>
servers during this ~4h window. We have a second server in progress for<br>
deploying infrastructure but the setup is not ready yet. Manual roll-out from<br>
sysadmins laptop was always possible in case of trouble.<br>
<br>
## Timeline of Events<br>
Note: All times in (CEDT)<br>
* 09:00 am: Nigel and Michael are planning a new http server inside the cage for<br>
logs, packages, and Coverity scans.<br>
* 10:00 am: Michael starts the ansible process to install the new server<br>
* 12:10 am: The topic of Centos 7.4 upgrade come during discussion and Michael<br>
does an upgrade and reboot on the <a href="http://salt-master.rax.gluster.org" rel="noreferrer" target="_blank">salt-master.rax.gluster.org</a>.<br>
* 12:15 pm: Michael notices that the salt-master server is not coming back.<br>
Nigel confirms.<br>
* 12:15 pm: Nigel logs into Rackspace and does a hard restart on the<br>
salt-master machine. No luck.<br>
* 12:34 pm: Nigel logs a ticket with Rackspace about the server outage.<br>
* 12:44 pm: Nigel starts chat conversation with Rackspace support for<br>
escalation. Customer support engineer informs us that the server is<br>
up and can be accessed via Emergency Console.<br>
* 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA of<br>
the isssue is a network problem caused by upgrade. Nigel confirms<br>
the RCA by verifying that eth0 does not have a public IP. Nigel<br>
tries to get the IP address to stick with the right gateway.<br>
* 12:35 pm: Nigel manages to get salt-master online briefly.<br>
* 13:34 pm: Nigel brings the salt-master back online.<br>
* 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming up<br>
either<br>
* 13:55 pm: Nigel brings back syslog back online as well.<br>
<br>
## Pending Actions<br>
* Michael to figure out if there is a bug in the new DHCP daemon, or if things<br>
changed Rackspace side.<br>
* Michael to finish move of salt-master into the cage<br>
(<a href="http://ant-queen.int.rht.gluster.org" rel="noreferrer" target="_blank">ant-queen.int.rht.gluster.org</a><wbr>) to prevent further issues.<br>
* Nigel to send a note to Rackspace support to fix their documentation.<br>
<br>
--<br>
Nigel and Michael<br>
Gluster Infra Team<br>
</div></div>