[Gluster-infra] Outage Announcement: Mar 14 2017

Sun Feb 19 15:40:42 UTC 2017

Hello folks,

Michael and I have brought up the community cage outage at the last meeting.
The details are now confirmed. There will be a major outage window for the
community cage on March 14-16 2017. The migration will begin around 1000 EST /
1600 CET / 2030 IST. During this time, servers will move from one DC (RAL2) to
a newer space in another building (RAL3). This is to accommodate our growth
(notably power and space). This is an unavoidable downtime as the networking
gear is also moving. The 2-days delay concerns the biggest tenant, Ceph, who
have a bigger footprint in the cage.

The Gluster servers have high priority as our setup is fairly easy and affects
the team. The cage team is confident that they can get us back online in US
East Coast working hours. We expect services to be ready at the end of the
first day and be back up and running on 15th morning in India.

We have people who can assist in case of problems with the move. Both Michael
and I will be ready to restart the system as soon as possible. As there will be
no hardware change but just a physical move, we are confident that things will
restart fast.

## Impact in Europe and Asia
Towards the second half of your working day, Gerrit and Jenkins will be down.
We will bring down Jenkins at 1300 UTC / 1400 CET / 1830 IST. We will abort any
outstanding job at this point. Other services will follow shortly.

## Impact in Americas
Jenkins and Gerrit will be down throughout your working day. Before the
migration starts, we'll power down servers to prevent data corruption.

## Services Affected
* Jenkins
* Gerrit (stage and prod)
* Fstat
* PostgreSQL
* CentOS CI (Not owned by us but also part of the move)

## FAQ
Q: Have you thought of moving things to Rackspace to avoid an outage ?
A: Yes, we did. The deployment of Jenkins/Gluster is not automated yet. This
   would have required a rather long downtime to move to Rackspace, as we saw
   when we moved from iWeb to the cage. This would have required a second
   outage to move back, making it impractical and more risky.

Q: What is going to happen during the outage window?
A: The exact schedule is still being worked on. The rough outline is to have 4h
   to stop network gears, move it, plug it back and configure it. Then, move
   Gluster rack and work on making it come back, which shouldn't take more than
   a few hours.

Q: What should we do without Gerrit and Jenkins?
A: Excellent question, here is a list of tasks that do not requires them
   * Bug triage, there is always bugs to reproduce.
   * Documentation would benefit from a cleanup and some restructuring.
   * Coverity scan reports and other static analysis we run is also in dire
   * need of a set of eyes.
   * User support - Stackoverflow users and gluster-users posters would love to
     get help
   * We're sure that our release managers would appreciate a through round of
     testing for 3.10.
   * Sysadmins love chocolate cake, so baking a cake for them would make them
     happy. Congratulations, you've actually read this email!

Q: Will the website, mailman, or downloads.gluster.org be affected?
A: No. They're currently hosted on Rackspace. We have no plan to move them in
   the next year (see Infrastructure Plans for 2017 on gluster-infra).

--
nigelb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20170219/d06b9e4a/attachment.sig>