[Gluster-infra] Gluster mailman outage portmortem

Mon Mar 12 13:01:19 UTC 2018

lists.gluster.org outage

Date: 2018-03-11

Participating people:
 - misc
 - amye

Summary:
supercolony.gluster.org had a disk full. So mailman was out.
https://bugzilla.redhat.com/show_bug.cgi?id=1554176

Impact:
- gluster lists user

Root cause:

The single partition of the server was full. 

This was most likely due to our WAF (mod_security) logs taking a ton of
space (around 1.5G for each week) , due to a uptake of bot scanning
around and the old wordpress blog being scanned. In turn, the old
wordpress blog wasn't supposed to be exposed anymore but the
configuration for it was still there since ansible never remove file,
and the configuration for the bare IP vhost is the first encountered
vhost, which was "blog.gluster.org". So in turn, the /xmlrpc.php url
was triggering alerts on the WAF, and mod_sec is kinda verbose.

Resolution:

- yum cache was cleaned to get back 600M as a emergency measure

  # yum clean all

- logs from mod_sec were compressed using gzip going from ~ 1.5G each
to 40M. 

  # for i in /var/log/httpd/modsec_audit.log-2* ; do gzip $i ; done

- blog.gluster.org vhost config was removed

  # rm -f /etc/httpd/conf/blog.gluster.org* ; service httpd restart 

Lessons learned:

- what went well:
  - a bug was filled
  - the root cause was quickly identified and fixed

- when we were lucky
  - misc was awake and connected on internal irc on the weekend night

- what went bad
  - no monitoring
  - bad partition setup
  - bad cleanup of httpd configuration

Timeline (in UTC)

- 2018-03-12  01:11  amye ping misc on internal irc and internal
channel with https://bugzilla.redhat.com/show_bug.cgi?id=1554176
- 2018-03-12  01:13  misc diagnose the issue on "disk full"
- 2018-03-12  01:17  misc free 600M while waiting on du -sh to finish
- 2018-03-12  01:22  misc pinpoint the issue on the WAF and compress
the log for further examination
- 2018-03-12  01:24  misc notice the wordpress exposure issue, remove
the  vhost from the config
- 2018-03-12 

Potential improvement to make:

- we need to install better monitoring

- the pattern of having 1 big server for everything should be changed,
as this lead to problem on cleanup, and lack of separation mean we have
1 single domain of failure (so issue on legacy system impact prod
system).
  - split duty of supercolony on separate VM
  - move it to the cage

- httpd logs should be rotated _and_ compressed.

- people shouldn't work on weekend

- reconsider mod_security usage on that server

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180312/66ece16b/attachment.sig>