[Gluster-devel] Reduce regression runs wait time - New gerrit/review work flow

Mon Jun 15 13:22:25 UTC 2015


On 06/15/2015 04:19 PM, Kaushal M wrote:
> Hi all,
> 
> The recent rush of reviews being sent due to the release of 3.7 was
> a cause of frustration for many of us because of the regression
> tests (gerrit troubles themselves are another thing).
> 
> W.R.T regression 3 main sources of frustration were, 1. Spurious
> test failures 2. Long wait times 3. Regression slave troubles
> 
> We've already tackled the spurious failure issue and are quite
> stable now. The trouble with the slave vms is related to the gerrit
> issues, and is mainly due to the network issues we are having
> between the data-centers hosting the slaves and gerrit/jenkins.
> People have been looking into this, but we haven't had much
> success. This leaves the issue of the long wait times.
> 
> The long wait times are because of the long queues of pending
> jobs, some of which take days to get scheduled. Two things cause
> the long queues, 1. Automatic regression job triggering for all
> submissions to gerrit 2. Long run time for regression (~2h)
> 
> The long queues coupled with the spurious failure and network 
> problems, meant that jobs would fail for no reason after a long
> wait, and would have to be added to the back of the queue to be
> re-run. This meant that developers would have to wait days for
> their changes to get merged, and was one of the causes for the
> delay in the release of 3.7.
> 
> The solution reduce wait times for regression runs. To reduce wait 
> times we should, 1. Trigger runs only when required 2. Reduce
> regression run time.
> 
> Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with
> his findings on the regression run times, and we can continue
> discussion on it on that thread.
> 
> Earlier, the regression runs used to be manually triggered by the 
> maintainers once they had determined that a change was ready for 
> submission. But as there were only two maintainers before (Vijay
> and Avati) auto triggering was brought in to reduce their load.
> Auto triggering worked fine when we had a lower volume of changes
> being submitted to gerrit. But now, with the large volumes we see
> during the release freeze dates, auto triggering just adds to
> problems.
> 
> I propose that we move back to the old model of starting
> regression runs only once the maintainers are ready to merge. But
> instead of the maintainers manually tiggering the runs, we could
> automate it.
> 
> We can model our new workflow on those of OpenStack[1] and 
> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't
> provide the features necessary to enable selective triggering based
> on Gerrit flags. Both OpenStack and Wikimedia use a project gating
> tool called Zuul[3], which provides a much better integration with
> Jenkins and Gerrit and more features on top.
> 
> I propose the following work flow,
> 
> - Developer pushes change to Gerrit. - Zuul is notified by Gerrit
> of new change - Zuul runs pre-review checks on Jenkins. This will
> be the current smoke tests. - Zuul reports back status of the
> checks to Gerrit. - If checks fail, developer will need to resend
> the change after the required fixes. The process starts once more. 
> - If the checks pass, the change is now ready for review - The
> change is now reviewed by other developers and maintainers. 
> Non-maintainers will be able to give only a +1 review. - On a
> negative review, the developer will need to rework the change and
> resend it. The process starts once more. - The maintainer give a +2
> review once he/she is satisfied. The maintainers work is done
> here. - Zuul is notified of the +2 review - Zuul runs the
> regression runs and reports back the status. - If the regression
> runs fail, the process starts over again. - If the runs pass, the
> change is ready for acceptance. - Zuul will pick the change into
> the repository. - If the pick fails, Zuul will report back the
> failure, and the process starts once again.
> 

+1, Good approach.

> Following this flow should, 1. Reduce regression wait time 2.
> Improve change acceptance time 3. Reduce unnecessary  wastage of
> infra resources 4. Improve infra stability.
> 
> It also brings in drawbacks that we need to maintain one other
> piece of infra (Zuul). This would be an additional maintenance
> overhead on top of Gerrit, Jenkins and the current slaves. But I
> feel the reduction in the upkeep efforts of the slaves would be
> enough to offset this.
> 
> tl;dr Current auto-triggering of regression runs is stupid and a
> waste of time and resources. Bring in a project gating system,
> Zuul, which can do a much more intelligent jobs triggering, and use
> it to automatically trigger regression only for changes with
> Reviewed+2 and automatically merge ones that pass.
> 
> What does the community think of this?
> 
> ~kaushal
> 
> [1]:
> http://docs.openstack.org/infra/manual/developers.html#automated-testing
>
> 
[2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow
> [3]: http://docs.openstack.org/infra/zuul/ 
> _______________________________________________ Gluster-devel
> mailing list Gluster-devel at gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-devel
>