[Gluster-infra] [Gluster-devel] Infra-related Regression Failures and What We're Doing

Tue Jan 23 09:00:36 UTC 2018

On Mon, Jan 22, 2018 at 5:13 PM, Nigel Babu <nigelb at redhat.com> wrote:

> Update: All the nodes that had problems with geo-rep are now fixed.
> Waiting on the patch to be merged before we switch over to Centos 7. If
> things go well, we'll replace nodes one by one as soon as we have one green
> on Centos 7.
>

I just noticed we failed again on the geo-rep tests @
https://build.gluster.org/job/centos6-regression/8604/console . Nigel
reconfirmed that we have all the machines cleaned up. What else could be
going wrong here?

> On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nigelb at redhat.com> wrote:
>
>> Hello folks,
>>
>> As you may have noticed, we've had a lot of centos6-regression failures
>> lately. The geo-replication failures are the new ones which particularly
>> concern me. These failures have nothing to do with the test. The tests are
>> exposing a problem in our infrastructure that we've carried around for a
>> long time. Our machines are not clean machines that we automated. We setup
>> automation on machines that were already created. At some point, we loaned
>> machines for debugging. During this time, developers have inadvertently
>> done 'make install' on the system to install onto system paths rather than
>> into /build/install. This is what is causing the geo-replication tests
>> to fail. I've tried cleaning the machines up several times with little to
>> no success.
>>
>> Last week, we decided to take an aggressive path to fix this problem. We
>> planned to replace all our problematic nodes with new Centos 7 nodes. This
>> exposed more problems. We expected a specific type of machine from
>> Rackspace. These are no longer offered. Thus, our automation fails on some
>> steps. I've spent this weekend tweaking our automation so that it works
>> on the new Rackspace machines and I'm down to just one test failure[1].
>> I have a patch up to fix this failure[2]. As soon as that patch is merged,
>> we can push forward with Centos7 nodes. In 4.0, we're dropping support for
>> Centos 6, so this decision makes more sense to do sooner than later.
>>
>> We'll not be lending machines anymore from production. We'll be creating
>> new nodes which are a snapshots of an existing production node. This
>> machine will be destroyed after use. This helps prevent this particular
>> problem in the future. This also means that our machine capacity at all
>> times is at 100 with very minimal wastage.
>>
>> [1]: https://build.gluster.org/job/cage-test/184/consoleText
>> [2]: https://review.gluster.org/#/c/19262/
>>
>> --
>> nigelb
>>
>
>
>
> --
> nigelb
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180123/ac019cc6/attachment.html>