[Gluster-devel] POC- Distributed regression testing framework

Amar Tumballi atumball at redhat.com
Thu Oct 4 07:45:53 UTC 2018


On Thu, Oct 4, 2018 at 12:54 PM Xavi Hernandez <jahernan at redhat.com> wrote:

> On Wed, Oct 3, 2018 at 11:57 AM Deepshikha Khandelwal <dkhandel at redhat.com>
> wrote:
>
>> Hello folks,
>>
>> Distributed-regression job[1] is now a part of Gluster's
>> nightly-master build pipeline. The following are the issues we have
>> resolved since we started working on this:
>>
>> 1) Collecting gluster logs from servers.
>> 2) Tests failed due to infra-related issues have been fixed.
>> 3) Time taken to run regression testing reduced to ~50-60 minutes.
>>
>> To get time down to 40 minutes needs your help!
>>
>> Currently, there is a test that is failing:
>>
>> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t
>>
>> This needs fixing first.
>>
>> There's a test that takes 14 minutes to complete -
>> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
>> 14 minutes is not something we can distribute. Can we look at how we
>> can speed this up[2]? When this test fails, it is re-attempted,
>> further increasing the time. This happens in the regular
>> centos7-regression job as well.
>>
>
> I made a change [1] to reduce the amount of time this tests needs. With
> this change the test completes in about 90 seconds. It would need some
> reviews from maintainers though.
>
> Do you want I send a patch with this change alone ?
>
> Xavi
>
> [1]
> https://review.gluster.org/#/c/glusterfs/+/19254/22/tests/bugs/index/bug-1559004-EMLINK-handling.t
>
>

Yes please! It would be useful! We can merge it sooner that way!

-Amar


>
>> If you see any other issues, please file a bug[3].
>>
>> [1]: https://build.gluster.org/job/distributed-regression
>> [2]: https://build.gluster.org/job/distributed-regression/264/console
>> [3]:
>> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure
>>
>> Thanks,
>> Deepshikha Khandelwal
>> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu <nigelb at redhat.com> wrote:
>> >
>> >
>> >
>> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi <atumball at redhat.com>
>> wrote:
>> >>
>> >>
>> >>
>> >>> There are currently a few known issues:
>> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>> >>
>> >>
>> >> If I look at the activities involved with regression failures, this
>> can wait.
>> >
>> >
>> > Well, we can't debug the current failures without having the logs. So
>> this has to be fixed first.
>> >
>> >>
>> >>
>> >>>
>> >>> * A few tests fail due to infra-related issues like geo-rep tests.
>> >>
>> >>
>> >> Please open bugs for this, so we can track them, and take it to
>> closure.
>> >
>> >
>> > These are failing due to infra reasons. Most likely subtle differences
>> in the setup of these nodes vs our normal nodes. We'll only be able to
>> debug them once we get the logs. I know the geo-rep ones are easy to fix.
>> The playbook for setting up geo-rep correctly just didn't make it over to
>> the playbook used for these images.
>> >
>> >>
>> >>
>> >>>
>> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
>> >>
>> >>
>> >> Time can change with more tests added, and also please plan to have
>> number of server as 1 to n.
>> >
>> >
>> > While the n is configurable, however it will be fixed to a single digit
>> number for now. We will need to place *some* limitation somewhere or else
>> we'll end up not being able to control our cloud bills.
>> >
>> >>
>> >>
>> >>>
>> >>> * We've only tested plain regressions. ASAN and Valgrind are
>> currently untested.
>> >>
>> >>
>> >> Great to have it running not 'per patch', but as nightly, or weekly to
>> start with.
>> >
>> >
>> > This is currently not targeted until we phase out current regressions.
>> >
>> >>>
>> >>>
>> >>> Before bringing it into production, we'll run this job nightly and
>> >>> watch it for a month to debug the other failures.
>> >>>
>> >>
>> >> I would say, bring it to production sooner, say 2 weeks, and also plan
>> to have the current regression as is with a special command like 'run
>> regression in-one-machine' in gerrit (or something similar) with voting
>> rights, so we can fall back to this method if something is broken in
>> parallel testing.
>> >>
>> >> I have seen that regardless of amount of time we put some scripts in
>> testing, the day we move to production, some thing would be broken. So, let
>> that happen earlier than later, so it would help next release branching
>> out. Don't want to be stuck for branching due to infra failures.
>> >
>> >
>> > Having two regression jobs that can vote is going to cause more
>> confusion than it's worth. There are a couple of intermittent memory issues
>> with the test script that we need to debug and fix before I'm comfortable
>> in making this job a voting job. We've worked around these problems right
>> now. It still pops up now and again. The fact that things break often is
>> not an excuse to prevent avoidable failures.  The one month timeline was
>> taken with all these factors into consideration. The 2-week timeline is a
>> no-go at this point.
>> >
>> > When we are ready to make the switch, we won't be switching 100% of the
>> job. We'll start with a sliding scale so that we can monitor failures and
>> machine creation adequately.
>> >
>> > --
>> > nigelb
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20181004/a852e5c2/attachment-0001.html>


More information about the Gluster-devel mailing list