[Gluster-infra] Build.gluster.org not sending mail 16 May 2016 postmortem
mscherer at redhat.com
Tue May 17 06:32:02 UTC 2016
here is yet another postomortem (cause I feel like it).
Build.gluster.org not sending mail
the ip address used by the Jenkins server of Gluster (build.gluster.org)
ended in a DNS blacklist, thus preventing sending mail on the mailing
list server (supercolony.gluster.org) among others.
- new releases wasn't notified to maintainers and Amye
- mail notification might not have been received
not found at the moment. Investigation showed that the ip address was
present in SBL, which pulled the ip from CBL (another blacklist).
However, looking at jenkins mail, none seems to have triggered that.
Upon further look, it was found that the ip address assigned to
build.g.o (22.214.171.124) is different from the one used for outgoing
connexion (126.96.36.199). This is caused by a asymetric setup for the
firewall NAT in the DC.
So infosec was notified of the problem, since this could have been
caused by a malware on any server behind the ip address.
Immediate fix was to remove sbl from the list of blacklist used by
supercolony, which was done by a commit on that file
(https://github.com/gluster/gluster.org_salt_pillar/blob/master/smtp_blacklist.sls ). Thus mail should be sendable again on jenkins.
- what went well:
- someone did seen that mail were not received and notified admins.
- when we were lucky
- not much critical mail traffic is coming from jenkins
- it failed during business hours of EMEA with misc being "idle" and
looking at irc, while on PTO.
- what went bad
- we do not have proper monitoring for that kind of issue
- there isn't details on how the server was added in the list
Timeline (in UTC)
14 May 2016
- 09:53 first message in the log about being in the blacklist
15 May 2016
- 12:36 ndevos ping misc on irc (#gluster-dev) about the problem
- 12:38 misc found that 188.8.131.52 is in sbl-xbl.spamhaus.org
- 12:43 misc found that the real CBL blocking is CBL, with "It was last
detected at 2016-05-15 00:00 GMT"
- 12:51 misc remember that the ip is shared, so that's normal to not
find anything the jenkins server
- 13:00 infosec is notified (INC0401121)
- 13:02 commit d552601 remove the dns bl from supercolony
- 20:00 infosec investigate and tell there is no sensor on that link
16 May 2016
- 06:30 postmortem is sent
Potential improvement to make:
- add monitoring for that
- check logs for errors
- add it to monitoring (either gluster side when we have it, or IT
- whitelist gluster server ip in postfix
- get a separate ip for the server
- proper infosec monitoring like the others
Sysadmin, Community Infrastructure and Platform, OSAS
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 836 bytes
Desc: This is a digitally signed message part
More information about the Gluster-infra