[Gluster-infra] [Gluster-devel] Reduce regression runs wait time - New gerrit/review work flow

Tue Jun 16 13:52:25 UTC 2015

+1 for running regressions on need basis.

Also we need to make the tests more intelligent so that same tests runs 
differently when triggered nightly/regular.

function test_some_functionality
{
     if env.NIGHTLY {
         # run nightly tests
         # time consuming tests
         # test with more data
         run_nightly_tests();
     }
     # Regular basic tests
     run_basic_tests();
}

TEST test_some_functionality;

This way we can maintain all tests in single place. Set env/config 
variable as NIGHTLY whenever we need to run nightly tests.

--
regards
Aravinda

On 06/15/2015 06:49 AM, Kaushal M wrote:
> Hi all,
>
> The recent rush of reviews being sent due to the release of 3.7 was a
> cause of frustration for many of us because of the regression tests
> (gerrit troubles themselves are another thing).
>
> W.R.T regression 3 main sources of frustration were,
> 1. Spurious test failures
> 2. Long wait times
> 3. Regression slave troubles
>
> We've already tackled the spurious failure issue and are quite stable
> now. The trouble with the slave vms is related to the gerrit issues,
> and is mainly due to the network issues we are having between the
> data-centers hosting the slaves and gerrit/jenkins. People have been
> looking into this, but we haven't had much success. This leaves the
> issue of the long wait times.
>
> The long wait times are because of the long queues of pending jobs,
> some of which take days to get scheduled. Two things cause the long
> queues,
> 1. Automatic regression job triggering for all submissions to gerrit
> 2. Long run time for regression (~2h)
>
> The long queues coupled with the spurious failure and network
> problems, meant that jobs would fail for no reason after a long wait,
> and would have to be added to the back of the queue to be re-run. This
> meant that developers would have to wait days for their changes to get
> merged, and was one of the causes for the delay in the release of 3.7.
>
> The solution reduce wait times for regression runs. To reduce wait
> times we should,
> 1. Trigger runs only when required
> 2. Reduce regression run time.
>
> Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his
> findings on the regression run times, and we can continue discussion
> on it on that thread.
>
> Earlier, the regression runs used to be manually triggered by the
> maintainers once they had determined that a change was ready for
> submission. But as there were only two maintainers before (Vijay and
> Avati) auto triggering was brought in to reduce their load. Auto
> triggering worked fine when we had a lower volume of changes being
> submitted to gerrit. But now, with the large volumes we see during the
> release freeze dates, auto triggering just adds to problems.
>
> I propose that we move back to the old model of starting regression
> runs only once the maintainers are ready to merge. But instead of the
> maintainers manually tiggering the runs, we could automate it.
>
> We can model our new workflow on those of OpenStack[1] and
> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide
> the features necessary to enable selective triggering based on Gerrit
> flags. Both OpenStack and Wikimedia use a project gating tool called
> Zuul[3], which provides a much better integration with Jenkins and
> Gerrit and more features on top.
>
> I propose the following work flow,
>
> - Developer pushes change to Gerrit.
>    - Zuul is notified by Gerrit of new change
> - Zuul runs pre-review checks on Jenkins. This will be the current smoke tests.
>    - Zuul reports back status of the checks to Gerrit.
>      - If checks fail, developer will need to resend the change after
> the required fixes. The process starts once more.
>      - If the checks pass, the change is now ready for review
> - The change is now reviewed by other developers and maintainers.
> Non-maintainers will be able to give only a +1 review.
>    - On a negative review, the developer will need to rework the change
> and resend it. The process starts once more.
> - The maintainer give a +2 review once he/she is satisfied. The
> maintainers work is done here.
>    - Zuul is notified of the +2 review
> - Zuul runs the regression runs and reports back the status.
>    - If the regression runs fail, the process starts over again.
>    - If the runs pass, the change is ready for acceptance.
> - Zuul will pick the change into the repository.
>    - If the pick fails, Zuul will report back the failure, and the
> process starts once again.
>
> Following this flow should,
> 1. Reduce regression wait time
> 2. Improve change acceptance time
> 3. Reduce unnecessary  wastage of infra resources
> 4. Improve infra stability.
>
> It also brings in drawbacks that we need to maintain one other piece
> of infra (Zuul). This would be an additional maintenance overhead on
> top of Gerrit, Jenkins and the current slaves. But I feel the
> reduction in the upkeep efforts of the slaves would be enough to
> offset this.
>
> tl;dr
> Current auto-triggering of regression runs is stupid and a waste of
> time and resources. Bring in a project gating system, Zuul, which can
> do a much more intelligent jobs triggering, and use it to
> automatically trigger regression only for changes with Reviewed+2 and
> automatically merge ones that pass.
>
> What does the community think of this?
>
> ~kaushal
>
> [1]: http://docs.openstack.org/infra/manual/developers.html#automated-testing
> [2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow
> [3]: http://docs.openstack.org/infra/zuul/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel