[Gluster-devel] POC- Distributed regression testing framework

Tue Jun 26 03:32:01 UTC 2018

On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi <atumball at redhat.com> wrote:

>
>
> There are currently a few known issues:
>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>>
>
> If I look at the activities involved with regression failures, this can
> wait.
>

Well, we can't debug the current failures without having the logs. So this
has to be fixed first.

>
>
>> * A few tests fail due to infra-related issues like geo-rep tests.
>>
>
> Please open bugs for this, so we can track them, and take it to closure.
>

These are failing due to infra reasons. Most likely subtle differences in
the setup of these nodes vs our normal nodes. We'll only be able to debug
them once we get the logs. I know the geo-rep ones are easy to fix. The
playbook for setting up geo-rep correctly just didn't make it over to the
playbook used for these images.

>
>
>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
>>
>
> Time can change with more tests added, and also please plan to have number
> of server as 1 to n.
>

While the n is configurable, however it will be fixed to a single digit
number for now. We will need to place *some* limitation somewhere or else
we'll end up not being able to control our cloud bills.

>
>
>> * We've only tested plain regressions. ASAN and Valgrind are currently
>> untested.
>>
>
> Great to have it running not 'per patch', but as nightly, or weekly to
> start with.
>

This is currently not targeted until we phase out current regressions.

>
>> Before bringing it into production, we'll run this job nightly and
>> watch it for a month to debug the other failures.
>>
>>
> I would say, bring it to production sooner, say 2 weeks, and also plan to
> have the current regression as is with a special command like 'run
> regression in-one-machine' in gerrit (or something similar) with voting
> rights, so we can fall back to this method if something is broken in
> parallel testing.
>
> I have seen that regardless of amount of time we put some scripts in
> testing, the day we move to production, some thing would be broken. So, let
> that happen earlier than later, so it would help next release branching
> out. Don't want to be stuck for branching due to infra failures.
>

Having two regression jobs that can vote is going to cause more confusion
than it's worth. There are a couple of intermittent memory issues with the
test script that we need to debug and fix before I'm comfortable in making
this job a voting job. We've worked around these problems right now. It
still pops up now and again. The fact that things break often is not an
excuse to prevent avoidable failures.  The one month timeline was taken
with all these factors into consideration. The 2-week timeline is a no-go
at this point.

When we are ready to make the switch, we won't be switching 100% of the
job. We'll start with a sliding scale so that we can monitor failures and
machine creation adequately.

-- 
nigelb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180626/2770bf7b/attachment.html>