[Gluster-infra] [Gluster-devel] Reboot policy for the infra

Thu Aug 23 08:49:59 UTC 2018

Le jeudi 23 août 2018 à 11:37 +0300, Yaniv Kaul a écrit :
> On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer <mscherer at redhat.co
> m>
> wrote:
> 
> > Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> > > One more piece that's missing is when we'll restart the physical
> > > servers.
> > > That seems to be entirely missing. The rest looks good to me and
> > > I'm
> > > happy
> > > to add an item to next sprint to automate the node rebooting.
> > 
> > That's covered as "as critical as the services that depend on them.
> > 
> > Now, the problem I do have is that some server (myrmicinae to name
> > it)
> > do take 30 minutes to reboot, and I can't diagnose nor fix without
> > taking hours. This is the one running gerrit/jenkins, so that's not
> > possible to spent time on this kind of test.
> > 
> 
> You'd imagine people would move to kexec reboots for VMs by now.
> Not sure why it's not catching up.
> (BTW, is it taking time to shutdown or to bring up?)
> Y.

To bring up according to my notes.

And I am not sure how kexec would work with microcode update. We also
need to upgrade bios sometime :/

> 
> > 
> > 
> > 
> > > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <mscherer at redhat.
> > > com>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > so that's kernel reboot time again, this time courtesy of Intel
> > > > (again). I do not consider the issue to be "OMG the sky is
> > > > falling",
> > > > but enough to take time to streamline our process to reboot.
> > > > 
> > > > 
> > > > 
> > > > Currently, we do not have a policy or anything, and I think the
> > > > negociation time around that is cumbersome:
> > > > - we need to reach people, which take time and add latency
> > > > (would
> > > > be
> > > > bad if that was a urgent issue, and likely add undeed stress
> > > > while
> > > > waiting)
> > > > 
> > > > - we need to keep track of what was supposed to be done, which
> > > > is
> > > > also
> > > > cumbersome
> > > > 
> > > > While that's not a problem if I had only gluster to deal with,
> > > > my
> > > > team
> > > > of 3 do have to deal with a few more projects than 1, and
> > > > orchestrating
> > > > choice for a dozen of group is time consuming (just think last
> > > > time
> > > > you
> > > > had to go to a restaurant after a conference to see how hard it
> > > > is
> > > > to
> > > > reach agreements).
> > > > 
> > > > So I would propose that we simplify that with the following
> > > > policy:
> > > > 
> > > > - Jenkins builder would be reboot by jenkins on a regular
> > > > basis.
> > > > I do not know how we can do that, but given that we have enough
> > > > node to
> > > > sustain builds, it shouldn't impact developpers in a big way.
> > > > The
> > > > only
> > > > exception is the freebsd builder, since we only have 1
> > > > functionnal
> > > > at
> > > > the moment. But once the 2nd is working, it should be treated
> > > > like
> > > > the
> > > > others.
> > > > 
> > > > - service in HA (firewall, reverse proxy, internal squid/DNS)
> > > > would
> > > > be
> > > > reboot during the day without notice. Due to working HA, that's
> > > > non
> > > > user impacting. In fact, that's already what I do.
> > > > 
> > > > - service not in HA should be pushed for HA (gerrit might get
> > > > there
> > > > one
> > > > day, no way for jenkins :/, need to see for postgres and so
> > > > fstat/softserve, and maybe try to get something for
> > > > download.gluster.org)
> > > > 
> > > > - service critical and not in HA should be announced in
> > > > advance.
> > > > Critical mean the service listed here: https://gluster-infra-do
> > > > cs.r
> > > > eadt
> > > > hedocs.io/emergency.html
> > > > 
> > > > - service non visible to end user (backup servers, ansible
> > > > deployment
> > > > etc) can be reboot at will
> > > > 
> > > > Then the only question is what about stuff not in the previous
> > > > category, like softserve, fstat.
> > > > 
> > > > Also, all dependencies are as critical as the most critical
> > > > service
> > > > that depend on them. So hypervisors hosting gerrit/jenkins are
> > > > critical
> > > > (until we find a way to avoid outage), the ones for builders
> > > > are
> > > > not.
> > > > 
> > > > 
> > > > 
> > > > Thoughts, ideas ?
> > > > 
> > > > 
> > > > --
> > > > Michael Scherer
> > > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > > 
> > > > _______________________________________________
> > > > Gluster-infra mailing list
> > > > Gluster-infra at gluster.org
> > > > https://lists.gluster.org/mailman/listinfo/gluster-infra
> > > 
> > > 
> > > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> > 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180823/0ffc9049/attachment.sig>