<div dir="ltr">One more piece that&#39;s missing is when we&#39;ll restart the physical servers. That seems to be entirely missing. The rest looks good to me and I&#39;m happy to add an item to next sprint to automate the node rebooting.<br></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer &lt;<a href="mailto:mscherer@redhat.com">mscherer@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

so that&#39;s kernel reboot time again, this time courtesy of Intel<br>

(again). I do not consider the issue to be &quot;OMG the sky is falling&quot;,<br>

but enough to take time to streamline our process to reboot.<br>

<br>

<br>

<br>

Currently, we do not have a policy or anything, and I think the<br>

negociation time around that is cumbersome:<br>

- we need to reach people, which take time and add latency (would be<br>

bad if that was a urgent issue, and likely add undeed stress while<br>

waiting)<br>

<br>

- we need to keep track of what was supposed to be done, which is also<br>

cumbersome<br>

<br>

While that&#39;s not a problem if I had only gluster to deal with, my team<br>

of 3 do have to deal with a few more projects than 1, and orchestrating<br>

choice for a dozen of group is time consuming (just think last time you<br>

had to go to a restaurant after a conference to see how hard it is to<br>

reach agreements).<br>

<br>

So I would propose that we simplify that with the following policy:<br>

<br>

- Jenkins builder would be reboot by jenkins on a regular basis. <br>

I do not know how we can do that, but given that we have enough node to<br>

sustain builds, it shouldn&#39;t impact developpers in a big way. The only<br>

exception is the freebsd builder, since we only have 1 functionnal at<br>

the moment. But once the 2nd is working, it should be treated like the<br>

others.<br>

<br>

- service in HA (firewall, reverse proxy, internal squid/DNS) would be<br>

reboot during the day without notice. Due to working HA, that&#39;s non<br>

user impacting. In fact, that&#39;s already what I do.<br>

<br>

- service not in HA should be pushed for HA (gerrit might get there one<br>

day, no way for jenkins :/, need to see for postgres and so<br>

fstat/softserve, and maybe try to get something for<br>

<a href="http://download.gluster.org" rel="noreferrer" target="_blank">download.gluster.org</a>)<br>

<br>

- service critical and not in HA should be announced in advance.<br>

Critical mean the service listed here: <a href="https://gluster-infra-docs.readt" rel="noreferrer" target="_blank">https://gluster-infra-docs.readt</a><br>

<a href="http://hedocs.io/emergency.html" rel="noreferrer" target="_blank">hedocs.io/emergency.html</a><br>

<br>

- service non visible to end user (backup servers, ansible deployment<br>

etc) can be reboot at will<br>

<br>

Then the only question is what about stuff not in the previous<br>

category, like softserve, fstat.<br>

<br>

Also, all dependencies are as critical as the most critical service<br>

that depend on them. So hypervisors hosting gerrit/jenkins are critical<br>

(until we find a way to avoid outage), the ones for builders are not.<br>

<br>

<br>

<br>

Thoughts, ideas ?<br>

<br>

<br>

-- <br>

Michael Scherer<br>

Sysadmin, Community Infrastructure and Platform, OSAS<br>

<br>

_______________________________________________<br>

Gluster-infra mailing list<br>

<a href="mailto:Gluster-infra@gluster.org" target="_blank">Gluster-infra@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-infra" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-infra</a></blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">nigelb<br></div></div>