[Gluster-devel] [Gluster-infra] Reboot policy for the infra

Yaniv Kaul ykaul at redhat.com
Thu Aug 23 08:37:10 UTC 2018


On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer <mscherer at redhat.com>
wrote:

> Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> > One more piece that's missing is when we'll restart the physical
> > servers.
> > That seems to be entirely missing. The rest looks good to me and I'm
> > happy
> > to add an item to next sprint to automate the node rebooting.
>
> That's covered as "as critical as the services that depend on them.
>
> Now, the problem I do have is that some server (myrmicinae to name it)
> do take 30 minutes to reboot, and I can't diagnose nor fix without
> taking hours. This is the one running gerrit/jenkins, so that's not
> possible to spent time on this kind of test.
>

You'd imagine people would move to kexec reboots for VMs by now.
Not sure why it's not catching up.
(BTW, is it taking time to shutdown or to bring up?)
Y.


>
>
>
> > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <mscherer at redhat.com>
> > wrote:
> >
> > > Hi,
> > >
> > > so that's kernel reboot time again, this time courtesy of Intel
> > > (again). I do not consider the issue to be "OMG the sky is
> > > falling",
> > > but enough to take time to streamline our process to reboot.
> > >
> > >
> > >
> > > Currently, we do not have a policy or anything, and I think the
> > > negociation time around that is cumbersome:
> > > - we need to reach people, which take time and add latency (would
> > > be
> > > bad if that was a urgent issue, and likely add undeed stress while
> > > waiting)
> > >
> > > - we need to keep track of what was supposed to be done, which is
> > > also
> > > cumbersome
> > >
> > > While that's not a problem if I had only gluster to deal with, my
> > > team
> > > of 3 do have to deal with a few more projects than 1, and
> > > orchestrating
> > > choice for a dozen of group is time consuming (just think last time
> > > you
> > > had to go to a restaurant after a conference to see how hard it is
> > > to
> > > reach agreements).
> > >
> > > So I would propose that we simplify that with the following policy:
> > >
> > > - Jenkins builder would be reboot by jenkins on a regular basis.
> > > I do not know how we can do that, but given that we have enough
> > > node to
> > > sustain builds, it shouldn't impact developpers in a big way. The
> > > only
> > > exception is the freebsd builder, since we only have 1 functionnal
> > > at
> > > the moment. But once the 2nd is working, it should be treated like
> > > the
> > > others.
> > >
> > > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > > be
> > > reboot during the day without notice. Due to working HA, that's non
> > > user impacting. In fact, that's already what I do.
> > >
> > > - service not in HA should be pushed for HA (gerrit might get there
> > > one
> > > day, no way for jenkins :/, need to see for postgres and so
> > > fstat/softserve, and maybe try to get something for
> > > download.gluster.org)
> > >
> > > - service critical and not in HA should be announced in advance.
> > > Critical mean the service listed here: https://gluster-infra-docs.r
> > > eadt
> > > hedocs.io/emergency.html
> > >
> > > - service non visible to end user (backup servers, ansible
> > > deployment
> > > etc) can be reboot at will
> > >
> > > Then the only question is what about stuff not in the previous
> > > category, like softserve, fstat.
> > >
> > > Also, all dependencies are as critical as the most critical service
> > > that depend on them. So hypervisors hosting gerrit/jenkins are
> > > critical
> > > (until we find a way to avoid outage), the ones for builders are
> > > not.
> > >
> > >
> > >
> > > Thoughts, ideas ?
> > >
> > >
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure and Platform, OSAS
> > >
> > > _______________________________________________
> > > Gluster-infra mailing list
> > > Gluster-infra at gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-infra
> >
> >
> >
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180823/09639797/attachment.html>


More information about the Gluster-devel mailing list