[Gluster-infra] [Gluster-devel] Infra-related Regression Failures and What We're Doing

Wed Jan 24 02:13:30 UTC 2018

Both the tests are now marked as bad since there's has been more than one
instance where these tests have failed even after fixing the infra problem.
Request geo-rep team to take a look at and revive the tests back soon.

On Tue, Jan 23, 2018 at 2:30 PM, Atin Mukherjee <amukherj at redhat.com> wrote:

>
>
> On Mon, Jan 22, 2018 at 5:13 PM, Nigel Babu <nigelb at redhat.com> wrote:
>
>> Update: All the nodes that had problems with geo-rep are now fixed.
>> Waiting on the patch to be merged before we switch over to Centos 7. If
>> things go well, we'll replace nodes one by one as soon as we have one green
>> on Centos 7.
>>
>
> I just noticed we failed again on the geo-rep tests @
> https://build.gluster.org/job/centos6-regression/8604/console . Nigel
> reconfirmed that we have all the machines cleaned up. What else could be
> going wrong here?
>
>
>> On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nigelb at redhat.com> wrote:
>>
>>> Hello folks,
>>>
>>> As you may have noticed, we've had a lot of centos6-regression failures
>>> lately. The geo-replication failures are the new ones which particularly
>>> concern me. These failures have nothing to do with the test. The tests are
>>> exposing a problem in our infrastructure that we've carried around for a
>>> long time. Our machines are not clean machines that we automated. We setup
>>> automation on machines that were already created. At some point, we loaned
>>> machines for debugging. During this time, developers have inadvertently
>>> done 'make install' on the system to install onto system paths rather than
>>> into /build/install. This is what is causing the geo-replication tests
>>> to fail. I've tried cleaning the machines up several times with little to
>>> no success.
>>>
>>> Last week, we decided to take an aggressive path to fix this problem. We
>>> planned to replace all our problematic nodes with new Centos 7 nodes. This
>>> exposed more problems. We expected a specific type of machine from
>>> Rackspace. These are no longer offered. Thus, our automation fails on some
>>> steps. I've spent this weekend tweaking our automation so that it works
>>> on the new Rackspace machines and I'm down to just one test failure[1].
>>> I have a patch up to fix this failure[2]. As soon as that patch is
>>> merged, we can push forward with Centos7 nodes. In 4.0, we're dropping
>>> support for Centos 6, so this decision makes more sense to do sooner than
>>> later.
>>>
>>> We'll not be lending machines anymore from production. We'll be creating
>>> new nodes which are a snapshots of an existing production node. This
>>> machine will be destroyed after use. This helps prevent this particular
>>> problem in the future. This also means that our machine capacity at all
>>> times is at 100 with very minimal wastage.
>>>
>>> [1]: https://build.gluster.org/job/cage-test/184/consoleText
>>> [2]: https://review.gluster.org/#/c/19262/
>>>
>>> --
>>> nigelb
>>>
>>
>>
>>
>> --
>> nigelb
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20180124/d360e59e/attachment.html>