[Gluster-infra] Build.gluster.org not sending mail 16 May 2016 postmortem

Michael Scherer mscherer at redhat.com
Tue May 17 06:32:02 UTC 2016


here is yet another postomortem (cause I feel like it).

Build.gluster.org not sending mail 

Date: 2016-04-27
Participating people:
 - misc

the ip address used by the Jenkins server of Gluster (build.gluster.org)
ended in a DNS blacklist, thus preventing sending mail on the mailing
list server (supercolony.gluster.org) among others.

- new releases wasn't notified to maintainers and Amye
- mail notification might not have been received

Root cause:
not found at the moment. Investigation showed that the ip address was
present in SBL, which pulled the ip from CBL (another blacklist).
However, looking at jenkins mail, none seems to have triggered that.
Upon further look, it was found that the ip address assigned to
build.g.o ( is different from the one used for outgoing
connexion ( This is caused by a asymetric setup for the
firewall NAT in the DC. 

So infosec was notified of the problem, since this could have been
caused by a malware on any server behind the ip address. 


Immediate fix was to remove sbl from the list of blacklist used by
supercolony, which was done by a commit on that file
(https://github.com/gluster/gluster.org_salt_pillar/blob/master/smtp_blacklist.sls ). Thus mail should be sendable again on jenkins.

Lessons learned:
- what went well:
  - someone did seen that mail were not received and notified admins.

- when we were lucky
  - not much critical mail traffic is coming from jenkins
  - it failed during business hours of EMEA with misc being "idle" and
looking at irc, while on PTO.

- what went bad
  - we do not have proper monitoring for that kind of issue
  - there isn't details on how the server was added in the list

Timeline (in UTC)
14 May 2016
- 09:53 first message in the log about being in the blacklist

15 May 2016
- 12:36 ndevos ping misc on irc (#gluster-dev) about the problem
- 12:38 misc found that is in sbl-xbl.spamhaus.org
- 12:43 misc found that the real CBL blocking is CBL, with "It was last
detected at 2016-05-15 00:00 GMT" 
- 12:51 misc remember that the ip is shared, so that's normal to not
find anything the jenkins server
- 13:00 infosec is notified (INC0401121)
- 13:02 commit d552601 remove the dns bl from supercolony
- 20:00 infosec investigate and tell there is no sensor on that link

16 May 2016
- 06:30 postmortem is sent

Potential improvement to make:
- add monitoring for that
  - check logs for errors
  - add it to monitoring (either gluster side when we have it, or IT
- whitelist gluster server ip in postfix
- get a separate ip for the server
- proper infosec monitoring like the others

Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20160517/b8315cf9/attachment.sig>

More information about the Gluster-infra mailing list