[Gluster-infra] Download.gluster.org 27 April 2016 postmortem

Wed Apr 27 11:39:45 UTC 2016

Excellent post-mortem!

Do you think its worth adding mirrors to gluster repos like oVirt is doing?
[1]

[1] http://ovirt-infra-docs.readthedocs.org/en/latest/General/Mirror.html

On Wed, Apr 27, 2016 at 1:56 PM, Michael Scherer <mscherer at redhat.com>
wrote:

> Hi,
>
> as promised, here is the post-mortem of the incident, if you would like
> to see more information, or any remarks, please do not hesitate, since
> that's the first attempt at it we do.
>
> I modelled it based on the example of
> http://shop.oreilly.com/product/0636920041528.do, as that the book I am
> reading at the moment (Appendix D). We will formalize that later.
>
>
>
> Download.gluster.org was not serving file
> Date: 2016-04-27
> Participating people:
>  - misc
>
> Summary:
>
> Download.gluster.org http server was showing error 403 for all url,
> which did impact ovirt jenkins jobs, and users using the repository,
> among others. The server is used to distribute gluster rpms.
>
> Impact:
> - ovirt CI jobs got blocked
> - user couldn't install gluster
>
> Root cause:
> the underlying block device on rackspace was down for a undiagnosed
> reason, triggering xfs error on the server and thus 403 on the http
> level.
>
> the root cause of the block device error is for still unknown, no error
> have been seen on the rackspace status page for this DC. A ticket was
> opened with rackspace to see what was going on (160427-iad-0000814), a
> follow up of this post-mortem will be done if the ticket say something
> more than "shit happens".
>
> Resolution:
>
> The whole server was rebooted, and upon reboot, the block device came
> back.
>
> Lessons learned:
> - what went well:
>   - people notified the admin quickly on irc and on gluster-infra
>
> - when we were lucky
>   - the server and block device came back immediately
>   - it failed during business hours of EMEA with misc being on irc (just
> arrived at the office)
>
>
> - what went bad
>   - we do not have proper HA for the service
>   - we do not have automated monitoring for it
>   - the setup is using 2 blocks device of 120G in lvm, thus making it
> twice as risky to fail
>
> Timeline (in UTC)
> - 05:39 first error message in the log about XFS error
> - 08:41 misc is pinged on irc
> - 08:56 misc ack and diagnose the issue
> - 09:00 the server and service is back to normal
> - 09:00 first mail about the problem hit gluster-infra
>
>
> Potential improvement to make:
> - add monitoring on gluster side
> - use the centos sig repo on ovirt side
> - add more sysadmin for gluster
> - add a redundant service for that
>   - a 2nd download server with a shared gluster backend
> - migrate the storage to a proper setup with 1 single block device,
> rather than 2.
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>

-- 
Eyal Edri
Associate Manager
RHEV DevOps
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20160427/7e6b6211/attachment.html>