[Gluster-infra] Post mortem for review.gluster.org DNS outage
Michael Scherer
mscherer at redhat.com
Fri Jun 14 09:55:10 UTC 2019
Date: 14 June 2019
Participating:
- misc
- deepshika
Summary:
--------
People started to report http issue on review.gluster.org, while our
monitoring was silent (monitoring keep spamming me during the night
about download server being almost full following 1719388, so I know it
was working). A quick investigation show this was due to the DNS record
to be returning 2 entries, which result into round robin between the
wrong server and the right one.
Timeline (in UTC):
-------------------
- 2019-05-08: misc go on vacation
- 2019-05-24: RH IT contact misc (and others), saying that mails with a
return address of "review at review.gluster.org" clog their smtp servers
queue. Folks receive the mails, the server say "this is no longer
someone working here", try to sent back, this doesn't work, it fill the
queue of the MX. As a few people left RH in the last 6 months, and some
getting likely all notifications, this did create problem for them.
Postfix is heavily I/O bound (all communication between the dozens of
daemons are done using queue on disk, synced for reliability), filling
queue result into impacting a lot the operation of a MX, slowing it
down, resulting in bigger queues, etc.
- 2019-05-27: Deepshika and Duck try to fix this, not understanding why
the email is not working, or how it was supposed to work (spoiler, it
was never supposed to work). Conclude by "too weird, we need to wait".
- 2019-06-12: misc is back from vacation, see his mails, prioritize
them and explain that the review at review.gluster.org wasn't supposed to
be working, hence why people found nothing, and this was just the
default setting of gerrit.
- 2019-06-13: misc decide to setup a MX for the review.gluster.org
domain to drop all incoming emails, solving the bounce issue for IT.
See ansible repo commits for how that's done.
- 16:57 misc add a MX record for review.gluster.org to point to
supercolony IP address, after adding the code to route the whole domain
to /dev/null, then wait a bit to see nothing broke and go home
(assuming monitoring would scream during the evening if anything
happen). The diff for the DNS change is show later[1].
- 23:00 seeing monitoring didn't scream, misc decide to go to bed and
sleep.
- 2019-06-14: folks start to report outage as India folks start their
day
- 04:17: bug 1720453 is opened
- 05:31: Deepshika correctly diagnose the DNS issue, see that is was
related to last change, and try to contact misc on telegram
- 07:50: misc wake up, see his phone blinking, answer to the messages
- 08:10: misc check various things, reach the same conclusion as
deepshika, propose a workaround
- 08:13: after squinting hard at the diff, misc finally find
something that could be the cause
- 08:14: a commit is pushed (again, see the end)
- 08:15: DNS record is verified, and seems to be fixed
- 08:50: coffee is poured in a mug in misc's flat, and that port
mortem is redacted
Impact:
-------
review.gluster.org was randomly reachable for some people for a few
hours. I suspect the cage wasn't affected due to DNS cache, but some
jobs might have been affected.
The gluster.org top domain might have been impacted too, but I am not
sure how (MX was in place, DNS too, and we do not use direct
gluster.org anywhere, plus, I think there is some fallback and cache),
and nobody did report anything (and the monitoring also didn't scream).
Root cause:
-----------
The DNS entry was wrong, it did return 2 IP addresses while it should
have been a single one. But the exact behavior was (IMHO) quite subtle,
as people will see now.
The initial DNS diff was this:
--- a/prod/external-default/gluster.org
+++ b/prod/external-default/gluster.org
@@ -1,6 +1,6 @@
$TTL 300
@ IN SOA ns1.redhat.com. noc.redhat.com. (
- 2019040301 ; Serial
+ 2019061301 ; Serial
3600 ; Refresh
1800 ; Retry
604800 ; Expire
@@ -12,6 +12,7 @@ $TTL 300
IN NS ns3.redhat.com.
;
IN MX 10 mx2.gluster.org.
+review IN MX 10 mx2.gluster.org.
;build IN MX 10 mx1.gluster.org.
@@ -34,7 +35,6 @@ lists IN CNAME supercolony.rht
git IN CNAME gerrit.rht
patches IN CNAME gerrit.rht
-review IN CNAME gerrit.rht
gerrit IN CNAME gerrit.rht
gerrit-new.rht IN CNAME gerrit.rht
@@ -60,6 +60,8 @@ _kerberos-master._udp SRV 0 0 88
freeipa.gluster.org.
_kerberos-master._tcp SRV 0 0 88 freeipa.gluster.org.
postgresql.rht IN A 8.43.85.170
+review IN A 8.43.85.171
gerrit.rht IN A 8.43.85.171
; testVM for the switch to nftable
chrono.rht IN A 8.43.85.172
At a first look, any sysadmin will likely say this seems correct,
converting review to a A record (cause MX and CNAME can't coexist, I
couldn't push that due to zone syntax check on commit), adding a MX
record.
I assume that the reader do not see what is wrong with this one (not
more than me when I wrote it yesterday, and did check my change), and
to be fair, what is wrong is not visible in the diff.
The fix was this (edited for readability):
--- a/prod/external-default/gluster.org
+++ b/prod/external-default/gluster.org
@@ -1,6 +1,6 @@
$TTL 300
@ IN SOA ns1.redhat.com. noc.redhat.com. (
- 2019061301 ; Serial
+ 2019061401 ; Serial
3600 ; Refresh
1800 ; Retry
604800 ; Expire
@@ -10,18 +10,19 @@ $TTL 300
IN NS ns1.redhat.com.
IN NS ns2.redhat.com.
IN NS ns3.redhat.com.
+ IN A 8.43.85.176
;
IN MX 10 mx2.gluster.org.
review IN MX 10 mx2.gluster.org.
- IN A 8.43.85.176
; RH DC
mx2 IN A 8.43.85.176
Turn out that contrary to what I did believe, the zone file format is
not a format where each line is fully separate, and where order do not
matter (there is $ORIGIN, etc).
When you add a entry and give no name in a record (first word on the
line), it doesn't use the domain name (that's the role of "@" or
$ORIGIN), but it inherit the previous one (see
https://en.wikipedia.org/wiki/Zone_file). So far, this did result in
the same effect for gluster.org zone file, because every record without
a explicit name (the first field) was at the start, and the first
record is the domain name.
But it all changed once I added the MX.
Cause this went from (edited to remove space, comment, and make the
issue obvious and visible)
IN NS ns3.redhat.com.
IN MX 10 mx2.gluster.org.
IN A 8.43.85.176
mx2 IN A 8.43.85.176
to:
IN NS ns3.redhat.com.
IN
MX 10 mx2.gluster.org.
review IN MX 10
mx2.gluster.org.
IN A 8.43.85.176
mx2
IN A 8.43.85.176
Which, using that presentation and indentation, kinda hint that there
is a problem. I always thought that the indentation was mostly
cosmetic, and the format (unlike python) do not requires it. Turn there
is more.
The first commit placed the MX record at the wrong place, which changed
the meaning of the following line (the one that was out of the diff).
This did result in review.gluster.org having a 2nd A record (for
8.43.85.176, supercolony), stealing the one of the apex domain (or top
or naked domain).
That is the same exact issue as
https://lists.gluster.org/pipermail/gluster-infra/2018-August/004905.html
(DNS one). Except that back then, I never found the problem.
Resolution:
------------
- DNS got fixed
What went well:
---------------
- not much, I was just lucky to find the issue. It was the 2nd time I
looked, and last time, I didn't found. I guess what went well is that
it didn't went worst.
When we were lucky:
-------------------
- I didn't overslept too late, and wasn't more jetlagged from
vacation[2]
- the issue was found quickly, which is close to a miracle given I just
woke up, and I didn't found back in august 2018.
What went bad:
--------------
- monitoring didn't alert of anything. Given DNS propagation, it should
have alerted me during the evening if something happened, or so did I
think so.
- DNS automated verification didn't pick that, because that was valid.
- manual verification didn't yield a error. Not sure why this did work
from my side of the world every time.
To do:
------
- contact the Holy Seer to get that certified as "miracle". I am not a
morning person.
- try to understand why monitoring failed to see something failed.
- now that we fixed the issue, go back to the change in August that
cause the issue the first time and apply again (routing
build.gluster.org to /dev/null). That work was on going yesterday
already, not pushed because it was late.
Notes
-----
[1] yes, this post mortem follow the Chekhov's gun principle.
[2] yes, that's not much for a lucky perspective. But I did manage to
sleep around 16h after taking the plane last week, it took me a while
to adjust.
--
Michael Scherer
Sysadmin, Community Infrastructure
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20190614/5f95e255/attachment.sig>
More information about the Gluster-infra
mailing list