[Gluster-infra] [Gluster-devel] Reduce regression runs wait time - New gerrit/review work flow

Thu Jun 18 06:44:40 UTC 2015

On 16-Jun-2015 19:22, "Aravinda" <avishwan at redhat.com> wrote:

> +1 for running regressions on need basis.
>
> Also we need to make the tests more intelligent so that same tests runs
> differently when triggered nightly/regular.
>
> function test_some_functionality
> {
>     if env.NIGHTLY {
>         # run nightly tests
>         # time consuming tests
>         # test with more data
>         run_nightly_tests();
>     }
>     # Regular basic tests
>     run_basic_tests();
> }
>
> TEST test_some_functionality;
>
> This way we can maintain all tests in single place. Set env/config
> variable as NIGHTLY whenever we need to run nightly tests.
>
>
I've discussed this idea with different people before. But the major
concern was how do we identify that minimal set of basic tests that would
do a good job of identifying most regressions. Considering that regression
suite will keep growing and will take longer to complete in the future,
nightly full regression runs would be nice to have.

--
> regards
> Aravinda
>
>
>
> On 06/15/2015 06:49 AM, Kaushal M wrote:
>
>> Hi all,
>>
>> The recent rush of reviews being sent due to the release of 3.7 was a
>> cause of frustration for many of us because of the regression tests
>> (gerrit troubles themselves are another thing).
>>
>> W.R.T regression 3 main sources of frustration were,
>> 1. Spurious test failures
>> 2. Long wait times
>> 3. Regression slave troubles
>>
>> We've already tackled the spurious failure issue and are quite stable
>> now. The trouble with the slave vms is related to the gerrit issues,
>> and is mainly due to the network issues we are having between the
>> data-centers hosting the slaves and gerrit/jenkins. People have been
>> looking into this, but we haven't had much success. This leaves the
>> issue of the long wait times.
>>
>> The long wait times are because of the long queues of pending jobs,
>> some of which take days to get scheduled. Two things cause the long
>> queues,
>> 1. Automatic regression job triggering for all submissions to gerrit
>> 2. Long run time for regression (~2h)
>>
>> The long queues coupled with the spurious failure and network
>> problems, meant that jobs would fail for no reason after a long wait,
>> and would have to be added to the back of the queue to be re-run. This
>> meant that developers would have to wait days for their changes to get
>> merged, and was one of the causes for the delay in the release of 3.7.
>>
>> The solution reduce wait times for regression runs. To reduce wait
>> times we should,
>> 1. Trigger runs only when required
>> 2. Reduce regression run time.
>>
>> Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his
>> findings on the regression run times, and we can continue discussion
>> on it on that thread.
>>
>> Earlier, the regression runs used to be manually triggered by the
>> maintainers once they had determined that a change was ready for
>> submission. But as there were only two maintainers before (Vijay and
>> Avati) auto triggering was brought in to reduce their load. Auto
>> triggering worked fine when we had a lower volume of changes being
>> submitted to gerrit. But now, with the large volumes we see during the
>> release freeze dates, auto triggering just adds to problems.
>>
>> I propose that we move back to the old model of starting regression
>> runs only once the maintainers are ready to merge. But instead of the
>> maintainers manually tiggering the runs, we could automate it.
>>
>> We can model our new workflow on those of OpenStack[1] and
>> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide
>> the features necessary to enable selective triggering based on Gerrit
>> flags. Both OpenStack and Wikimedia use a project gating tool called
>> Zuul[3], which provides a much better integration with Jenkins and
>> Gerrit and more features on top.
>>
>> I propose the following work flow,
>>
>> - Developer pushes change to Gerrit.
>>    - Zuul is notified by Gerrit of new change
>> - Zuul runs pre-review checks on Jenkins. This will be the current smoke
>> tests.
>>    - Zuul reports back status of the checks to Gerrit.
>>      - If checks fail, developer will need to resend the change after
>> the required fixes. The process starts once more.
>>      - If the checks pass, the change is now ready for review
>> - The change is now reviewed by other developers and maintainers.
>> Non-maintainers will be able to give only a +1 review.
>>    - On a negative review, the developer will need to rework the change
>> and resend it. The process starts once more.
>> - The maintainer give a +2 review once he/she is satisfied. The
>> maintainers work is done here.
>>    - Zuul is notified of the +2 review
>> - Zuul runs the regression runs and reports back the status.
>>    - If the regression runs fail, the process starts over again.
>>    - If the runs pass, the change is ready for acceptance.
>> - Zuul will pick the change into the repository.
>>    - If the pick fails, Zuul will report back the failure, and the
>> process starts once again.
>>
>> Following this flow should,
>> 1. Reduce regression wait time
>> 2. Improve change acceptance time
>> 3. Reduce unnecessary  wastage of infra resources
>> 4. Improve infra stability.
>>
>> It also brings in drawbacks that we need to maintain one other piece
>> of infra (Zuul). This would be an additional maintenance overhead on
>> top of Gerrit, Jenkins and the current slaves. But I feel the
>> reduction in the upkeep efforts of the slaves would be enough to
>> offset this.
>>
>> tl;dr
>> Current auto-triggering of regression runs is stupid and a waste of
>> time and resources. Bring in a project gating system, Zuul, which can
>> do a much more intelligent jobs triggering, and use it to
>> automatically trigger regression only for changes with Reviewed+2 and
>> automatically merge ones that pass.
>>
>> What does the community think of this?
>>
>> ~kaushal
>>
>> [1]:
>> http://docs.openstack.org/infra/manual/developers.html#automated-testing
>> [2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow
>> [3]: http://docs.openstack.org/infra/zuul/
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20150618/525d57fe/attachment-0001.html>