[Gluster-infra] Post mortem for review.gluster.org DNS outage

Fri Jun 14 09:55:10 UTC 2019

Date: 14 June 2019

Participating:
- misc
- deepshika

Summary:
--------
People started to report http issue on review.gluster.org, while our
monitoring was silent (monitoring keep spamming me during the night
about download server being almost full following 1719388, so I know it
was working). A quick investigation show this was due to the DNS record
to be returning 2 entries, which result into round robin between the
wrong server and the right one.

Timeline (in UTC):
-------------------

- 2019-05-08: misc go on vacation

- 2019-05-24: RH IT contact misc (and others), saying that mails with a
return address of "review at review.gluster.org" clog their smtp servers
queue. Folks receive the mails, the server say "this is no longer
someone working here", try to sent back, this doesn't work, it fill the
queue of the MX. As a few people left RH in the last 6 months, and some
getting likely all notifications, this did create problem for them.

Postfix is heavily I/O bound (all communication between the dozens of
daemons are done using queue on disk, synced for reliability), filling
queue result into impacting a lot the operation of a MX, slowing it
down, resulting in bigger queues, etc. 


- 2019-05-27: Deepshika and Duck try to fix this, not understanding why
the email is not working, or how it was supposed to work (spoiler, it
was never supposed to work). Conclude by "too weird, we need to wait".

- 2019-06-12: misc is back from vacation, see his mails, prioritize
them and explain that the review at review.gluster.org wasn't supposed to
be working, hence why people found nothing, and this was just the
default setting of gerrit.

- 2019-06-13: misc decide to setup a MX for the review.gluster.org
domain to drop all incoming emails, solving the bounce issue for IT.
See ansible repo commits for how that's done.

  - 16:57 misc add a MX record for review.gluster.org to point to
supercolony IP address, after adding the code to route the whole domain
to /dev/null, then wait a bit to see nothing broke and go home
(assuming monitoring would scream during the evening if anything
happen). The diff for the DNS change is show later[1]. 
  - 23:00 seeing monitoring didn't scream, misc decide to go to bed and
sleep.

- 2019-06-14: folks start to report outage as India folks start their
day
  - 04:17: bug 1720453 is opened
  - 05:31: Deepshika correctly diagnose the DNS issue, see that is was
related to last change, and try to contact misc on telegram
  - 07:50: misc wake up, see his phone blinking, answer to the messages
  - 08:10: misc check various things, reach the same conclusion as
deepshika, propose a workaround
  - 08:13: after squinting hard at the diff, misc finally find
something that could be the cause
  - 08:14: a commit is pushed (again, see the end)
  - 08:15: DNS record is verified, and seems to be fixed
  - 08:50: coffee is poured in a mug in misc's flat, and that port
mortem is redacted


Impact:
-------
review.gluster.org was randomly reachable for some people for a few
hours. I suspect the cage wasn't affected due to DNS cache, but some
jobs might have been affected.

The gluster.org top domain might have been impacted too, but I am not
sure how (MX was in place, DNS too, and we do not use direct
gluster.org anywhere, plus, I think there is some fallback and cache),
and nobody did report anything (and the monitoring also didn't scream).

Root cause:
-----------

The DNS entry was wrong, it did return 2 IP addresses while it should
have been a single one. But the exact behavior was (IMHO) quite subtle,
as people will see now.

The initial DNS diff was this:

--- a/prod/external-default/gluster.org
+++ b/prod/external-default/gluster.org
@@ -1,6 +1,6 @@
 $TTL 300
 @       IN      SOA     ns1.redhat.com. noc.redhat.com.  (
-                                      2019040301 ; Serial
+                                      2019061301 ; Serial
                                       3600       ; Refresh
                                       1800       ; Retry
                                       604800     ; Expire
@@ -12,6 +12,7 @@ $TTL 300
                 IN       NS        ns3.redhat.com.
 ;
                 IN       MX    10  mx2.gluster.org.
+review          IN       MX    10  mx2.gluster.org.
 ;build           IN       MX    10  mx1.gluster.org.
 
 
@@ -34,7 +35,6 @@ lists           IN        CNAME    supercolony.rht
 
 git             IN        CNAME    gerrit.rht
 patches         IN        CNAME    gerrit.rht
-review          IN        CNAME    gerrit.rht
 gerrit          IN        CNAME    gerrit.rht
 gerrit-new.rht  IN        CNAME    gerrit.rht
 
@@ -60,6 +60,8 @@ _kerberos-master._udp   SRV     0 0 88
freeipa.gluster.org.
 _kerberos-master._tcp   SRV     0 0 88 freeipa.gluster.org.
 
 postgresql.rht  IN        A       8.43.85.170
+review          IN        A       8.43.85.171
 gerrit.rht      IN        A       8.43.85.171
 ; testVM for the switch to nftable
 chrono.rht      IN        A       8.43.85.172


At a first look, any sysadmin will likely say this seems correct,
converting review to a A record (cause MX and CNAME can't coexist, I
couldn't push that due to zone syntax check on commit), adding a MX
record. 

I assume that the reader do not see what is wrong with this one (not
more than me when I wrote it yesterday, and did check my change), and
to be fair, what is wrong is not visible in the diff.

The fix was this (edited for readability):


--- a/prod/external-default/gluster.org
+++ b/prod/external-default/gluster.org
@@ -1,6 +1,6 @@
 $TTL 300
 @       IN      SOA     ns1.redhat.com. noc.redhat.com.  (
-                                      2019061301 ; Serial
+                                      2019061401 ; Serial
                                       3600       ; Refresh
                                       1800       ; Retry
                                       604800     ; Expire
@@ -10,18 +10,19 @@ $TTL 300
                 IN       NS        ns1.redhat.com.
                 IN       NS        ns2.redhat.com.
                 IN       NS        ns3.redhat.com.
+                IN       A         8.43.85.176
 ;
                 IN       MX    10  mx2.gluster.org.
 review          IN       MX    10  mx2.gluster.org.
 
 
 
-                IN        A        8.43.85.176
 ; RH DC
 mx2             IN        A        8.43.85.176
 


Turn out that contrary to what I did believe, the zone file format is
not a format where each line is fully separate, and where order do not
matter (there is $ORIGIN, etc). 

When you add a entry and give no name in a record (first word on the
line), it doesn't use the domain name (that's the role of "@" or
$ORIGIN), but it inherit the previous one (see 
https://en.wikipedia.org/wiki/Zone_file). So far, this did result in
the same effect for gluster.org zone file, because every record without
a explicit name (the first field) was at the start, and the first
record is the domain name. 

But it all changed once I added the MX. 

Cause this went from (edited to remove space, comment, and make the
issue obvious and visible)

                 IN       NS        ns3.redhat.com.
                 IN       MX    10  mx2.gluster.org.
                 IN       A         8.43.85.176
 mx2             IN       A         8.43.85.176

to: 

                 IN       NS        ns3.redhat.com.
                 IN 
      MX    10  mx2.gluster.org.
 review          IN       MX    10 
mx2.gluster.org.
                 IN       A         8.43.85.176
 mx2    
         IN       A         8.43.85.176

Which, using that presentation and indentation, kinda hint that there
is a problem. I always thought that the indentation was mostly
cosmetic, and the format (unlike python) do not requires it. Turn there
is more.

The first commit placed the MX record at the wrong place, which changed
the meaning of the following line (the one that was out of the diff).
This did result in review.gluster.org having a 2nd A record (for
8.43.85.176, supercolony), stealing the one of the apex domain (or top
or naked domain).

That is the same exact issue as 
https://lists.gluster.org/pipermail/gluster-infra/2018-August/004905.html
 (DNS one). Except that back then, I never found the problem.


Resolution:
------------
- DNS got fixed

What went well:
---------------
- not much, I was just lucky to find the issue. It was the 2nd time I
looked, and last time, I didn't found. I guess what went well is that
it didn't went worst.

When we were lucky: 
-------------------

- I didn't overslept too late, and wasn't more jetlagged from
vacation[2]
- the issue was found quickly, which is close to a miracle given I just
woke up, and I didn't found back in august 2018.

What went bad:
--------------
- monitoring didn't alert of anything. Given DNS propagation, it should
have alerted me during the evening if something happened, or so did I
think so.

- DNS automated verification didn't pick that, because that was valid. 
- manual verification didn't yield a error. Not sure why this did work
from my side of the world every time.

To do:
------

- contact the Holy Seer to get that certified as "miracle". I am not a
morning person.

- try to understand why monitoring failed to see something failed.

- now that we fixed the issue, go back to the change in August that
cause the issue the first time and apply again (routing
build.gluster.org to /dev/null). That work was on going yesterday
already, not pushed because it was late.

Notes
-----
[1] yes, this post mortem follow the Chekhov's gun principle.
[2] yes, that's not much for a lucky perspective. But I did manage to
sleep around 16h after taking the plane last week, it took me a while
to adjust.

-- 
Michael Scherer
Sysadmin, Community Infrastructure



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20190614/5f95e255/attachment.sig>