[Gluster-devel] POC- Distributed regression testing framework

Sanju Rakonde srakonde at redhat.com
Thu Oct 4 17:19:40 UTC 2018


Deepshika,

I see that tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t test
is failed in today's run #273. But I couldn't get logs from
https://ci-logs.gluster.org/distributed-regression-logs-273.tgz , I see 404
Not found error with message saying "The requested URL
/distributed-regression-logs-273.tgz was not found on this server."
Please help me in getting the logs.

On Thu, Oct 4, 2018 at 10:31 PM Atin Mukherjee <amukherj at redhat.com> wrote:

> Deepshika,
>
> Please keep us posted on if you see the particular glusterd test failing
> again.  It’ll be great to see this nightly job green sooner than later :-) .
>
> On Thu, 4 Oct 2018 at 15:07, Deepshikha Khandelwal <dkhandel at redhat.com>
> wrote:
>
>> On Thu, Oct 4, 2018 at 6:10 AM Sanju Rakonde <srakonde at redhat.com> wrote:
>> >
>> >
>> >
>> > On Wed, Oct 3, 2018 at 3:26 PM Deepshikha Khandelwal <
>> dkhandel at redhat.com> wrote:
>> >>
>> >> Hello folks,
>> >>
>> >> Distributed-regression job[1] is now a part of Gluster's
>> >> nightly-master build pipeline. The following are the issues we have
>> >> resolved since we started working on this:
>> >>
>> >> 1) Collecting gluster logs from servers.
>> >> 2) Tests failed due to infra-related issues have been fixed.
>> >> 3) Time taken to run regression testing reduced to ~50-60 minutes.
>> >>
>> >> To get time down to 40 minutes needs your help!
>> >>
>> >> Currently, there is a test that is failing:
>> >>
>> >> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t
>> >>
>> >> This needs fixing first.
>> >
>> >
>> > Where can I get the logs of this test case? In
>> https://build.gluster.org/job/distributed-regression/264/console I see
>> this test case is failed and re-attempted. But I couldn't find logs.
>> There's a link in the end of console output where you can look for the
>> logs of failed tests.
>> We had a bug in the setup and the logs were not getting saved. We've
>> fixed this and future jobs should have the logs at the log collector's
>> link show up in the console output.
>>
>> >>
>> >>
>> >> There's a test that takes 14 minutes to complete -
>> >> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
>> >> 14 minutes is not something we can distribute. Can we look at how we
>> >> can speed this up[2]? When this test fails, it is re-attempted,
>> >> further increasing the time. This happens in the regular
>> >> centos7-regression job as well.
>> >>
>> >> If you see any other issues, please file a bug[3].
>> >>
>> >> [1]: https://build.gluster.org/job/distributed-regression
>> >> [2]: https://build.gluster.org/job/distributed-regression/264/console
>> >> [3]:
>> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure
>> >>
>> >> Thanks,
>> >> Deepshikha Khandelwal
>> >> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu <nigelb at redhat.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi <atumball at redhat.com>
>> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >>> There are currently a few known issues:
>> >> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>> >> >>
>> >> >>
>> >> >> If I look at the activities involved with regression failures, this
>> can wait.
>> >> >
>> >> >
>> >> > Well, we can't debug the current failures without having the logs.
>> So this has to be fixed first.
>> >> >
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> * A few tests fail due to infra-related issues like geo-rep tests.
>> >> >>
>> >> >>
>> >> >> Please open bugs for this, so we can track them, and take it to
>> closure.
>> >> >
>> >> >
>> >> > These are failing due to infra reasons. Most likely subtle
>> differences in the setup of these nodes vs our normal nodes. We'll only be
>> able to debug them once we get the logs. I know the geo-rep ones are easy
>> to fix. The playbook for setting up geo-rep correctly just didn't make it
>> over to the playbook used for these images.
>> >> >
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60
>> minutes)
>> >> >>
>> >> >>
>> >> >> Time can change with more tests added, and also please plan to have
>> number of server as 1 to n.
>> >> >
>> >> >
>> >> > While the n is configurable, however it will be fixed to a single
>> digit number for now. We will need to place *some* limitation somewhere or
>> else we'll end up not being able to control our cloud bills.
>> >> >
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> * We've only tested plain regressions. ASAN and Valgrind are
>> currently untested.
>> >> >>
>> >> >>
>> >> >> Great to have it running not 'per patch', but as nightly, or weekly
>> to start with.
>> >> >
>> >> >
>> >> > This is currently not targeted until we phase out current
>> regressions.
>> >> >
>> >> >>>
>> >> >>>
>> >> >>> Before bringing it into production, we'll run this job nightly and
>> >> >>> watch it for a month to debug the other failures.
>> >> >>>
>> >> >>
>> >> >> I would say, bring it to production sooner, say 2 weeks, and also
>> plan to have the current regression as is with a special command like 'run
>> regression in-one-machine' in gerrit (or something similar) with voting
>> rights, so we can fall back to this method if something is broken in
>> parallel testing.
>> >> >>
>> >> >> I have seen that regardless of amount of time we put some scripts
>> in testing, the day we move to production, some thing would be broken. So,
>> let that happen earlier than later, so it would help next release branching
>> out. Don't want to be stuck for branching due to infra failures.
>> >> >
>> >> >
>> >> > Having two regression jobs that can vote is going to cause more
>> confusion than it's worth. There are a couple of intermittent memory issues
>> with the test script that we need to debug and fix before I'm comfortable
>> in making this job a voting job. We've worked around these problems right
>> now. It still pops up now and again. The fact that things break often is
>> not an excuse to prevent avoidable failures.  The one month timeline was
>> taken with all these factors into consideration. The 2-week timeline is a
>> no-go at this point.
>> >> >
>> >> > When we are ready to make the switch, we won't be switching 100% of
>> the job. We'll start with a sliding scale so that we can monitor failures
>> and machine creation adequately.
>> >> >
>> >> > --
>> >> > nigelb
>> >> _______________________________________________
>> >> Gluster-devel mailing list
>> >> Gluster-devel at gluster.org
>> >> https://lists.gluster.org/mailman/listinfo/gluster-devel
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Sanju
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> --
> - Atin (atinm)
>

-- 
Thanks,
Sanju
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20181004/71cc0e86/attachment.html>


More information about the Gluster-devel mailing list