<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer <span dir="ltr"><<a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :<br>
> One more piece that's missing is when we'll restart the physical<br>
> servers.<br>
> That seems to be entirely missing. The rest looks good to me and I'm<br>
> happy<br>
> to add an item to next sprint to automate the node rebooting.<br>
<br>
That's covered as "as critical as the services that depend on them.<br>
<br>
Now, the problem I do have is that some server (myrmicinae to name it)<br>
do take 30 minutes to reboot, and I can't diagnose nor fix without<br>
taking hours. This is the one running gerrit/jenkins, so that's not<br>
possible to spent time on this kind of test.<br></blockquote><div><br></div><div>You'd imagine people would move to kexec reboots for VMs by now.</div><div>Not sure why it's not catching up.</div><div>(BTW, is it taking time to shutdown or to bring up?)</div><div>Y.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
<br>
> On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer <<a href="mailto:mscherer@redhat.com">mscherer@redhat.com</a>><br>
> wrote:<br>
> <br>
> > Hi,<br>
> > <br>
> > so that's kernel reboot time again, this time courtesy of Intel<br>
> > (again). I do not consider the issue to be "OMG the sky is<br>
> > falling",<br>
> > but enough to take time to streamline our process to reboot.<br>
> > <br>
> > <br>
> > <br>
> > Currently, we do not have a policy or anything, and I think the<br>
> > negociation time around that is cumbersome:<br>
> > - we need to reach people, which take time and add latency (would<br>
> > be<br>
> > bad if that was a urgent issue, and likely add undeed stress while<br>
> > waiting)<br>
> > <br>
> > - we need to keep track of what was supposed to be done, which is<br>
> > also<br>
> > cumbersome<br>
> > <br>
> > While that's not a problem if I had only gluster to deal with, my<br>
> > team<br>
> > of 3 do have to deal with a few more projects than 1, and<br>
> > orchestrating<br>
> > choice for a dozen of group is time consuming (just think last time<br>
> > you<br>
> > had to go to a restaurant after a conference to see how hard it is<br>
> > to<br>
> > reach agreements).<br>
> > <br>
> > So I would propose that we simplify that with the following policy:<br>
> > <br>
> > - Jenkins builder would be reboot by jenkins on a regular basis.<br>
> > I do not know how we can do that, but given that we have enough<br>
> > node to<br>
> > sustain builds, it shouldn't impact developpers in a big way. The<br>
> > only<br>
> > exception is the freebsd builder, since we only have 1 functionnal<br>
> > at<br>
> > the moment. But once the 2nd is working, it should be treated like<br>
> > the<br>
> > others.<br>
> > <br>
> > - service in HA (firewall, reverse proxy, internal squid/DNS) would<br>
> > be<br>
> > reboot during the day without notice. Due to working HA, that's non<br>
> > user impacting. In fact, that's already what I do.<br>
> > <br>
> > - service not in HA should be pushed for HA (gerrit might get there<br>
> > one<br>
> > day, no way for jenkins :/, need to see for postgres and so<br>
> > fstat/softserve, and maybe try to get something for<br>
> > <a href="http://download.gluster.org" rel="noreferrer" target="_blank">download.gluster.org</a>)<br>
> > <br>
> > - service critical and not in HA should be announced in advance.<br>
> > Critical mean the service listed here: <a href="https://gluster-infra-docs.r" rel="noreferrer" target="_blank">https://gluster-infra-docs.r</a><br>
> > eadt<br>
> > <a href="http://hedocs.io/emergency.html" rel="noreferrer" target="_blank">hedocs.io/emergency.html</a><br>
> > <br>
> > - service non visible to end user (backup servers, ansible<br>
> > deployment<br>
> > etc) can be reboot at will<br>
> > <br>
> > Then the only question is what about stuff not in the previous<br>
> > category, like softserve, fstat.<br>
> > <br>
> > Also, all dependencies are as critical as the most critical service<br>
> > that depend on them. So hypervisors hosting gerrit/jenkins are<br>
> > critical<br>
> > (until we find a way to avoid outage), the ones for builders are<br>
> > not.<br>
> > <br>
> > <br>
> > <br>
> > Thoughts, ideas ?<br>
<span class="">> > <br>
> > <br>
> > --<br>
> > Michael Scherer<br>
> > Sysadmin, Community Infrastructure and Platform, OSAS<br>
> > <br>
</span>> > ______________________________<wbr>_________________<br>
> > Gluster-infra mailing list<br>
> > <a href="mailto:Gluster-infra@gluster.org">Gluster-infra@gluster.org</a><br>
> > <a href="https://lists.gluster.org/mailman/listinfo/gluster-infra" rel="noreferrer" target="_blank">https://lists.gluster.org/<wbr>mailman/listinfo/gluster-infra</a><br>
<div class="HOEnZb"><div class="h5">> <br>
> <br>
> <br>
-- <br>
Michael Scherer<br>
Sysadmin, Community Infrastructure and Platform, OSAS<br>
<br>
</div></div><br>______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br></blockquote></div><br></div></div>