[Gluster-devel] Reduce regression runs wait time - New gerrit/review work flow

Thu Aug 20 16:16:06 UTC 2015

Infra team,

Given that the regression is always overloaded and spurious failures are
always haunting us, can we try this out? We should also try to have a
separate smoke verified flag just to give enough confidence to the
reviewers as Jeff pointed out in this thread.

-Atin
Sent from one plus one
On Jun 15, 2015 4:19 PM, "Kaushal M" <kshlmster at gmail.com> wrote:

> Hi all,
>
> The recent rush of reviews being sent due to the release of 3.7 was a
> cause of frustration for many of us because of the regression tests
> (gerrit troubles themselves are another thing).
>
> W.R.T regression 3 main sources of frustration were,
> 1. Spurious test failures
> 2. Long wait times
> 3. Regression slave troubles
>
> We've already tackled the spurious failure issue and are quite stable
> now. The trouble with the slave vms is related to the gerrit issues,
> and is mainly due to the network issues we are having between the
> data-centers hosting the slaves and gerrit/jenkins. People have been
> looking into this, but we haven't had much success. This leaves the
> issue of the long wait times.
>
> The long wait times are because of the long queues of pending jobs,
> some of which take days to get scheduled. Two things cause the long
> queues,
> 1. Automatic regression job triggering for all submissions to gerrit
> 2. Long run time for regression (~2h)
>
> The long queues coupled with the spurious failure and network
> problems, meant that jobs would fail for no reason after a long wait,
> and would have to be added to the back of the queue to be re-run. This
> meant that developers would have to wait days for their changes to get
> merged, and was one of the causes for the delay in the release of 3.7.
>
> The solution reduce wait times for regression runs. To reduce wait
> times we should,
> 1. Trigger runs only when required
> 2. Reduce regression run time.
>
> Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his
> findings on the regression run times, and we can continue discussion
> on it on that thread.
>
> Earlier, the regression runs used to be manually triggered by the
> maintainers once they had determined that a change was ready for
> submission. But as there were only two maintainers before (Vijay and
> Avati) auto triggering was brought in to reduce their load. Auto
> triggering worked fine when we had a lower volume of changes being
> submitted to gerrit. But now, with the large volumes we see during the
> release freeze dates, auto triggering just adds to problems.
>
> I propose that we move back to the old model of starting regression
> runs only once the maintainers are ready to merge. But instead of the
> maintainers manually tiggering the runs, we could automate it.
>
> We can model our new workflow on those of OpenStack[1] and
> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide
> the features necessary to enable selective triggering based on Gerrit
> flags. Both OpenStack and Wikimedia use a project gating tool called
> Zuul[3], which provides a much better integration with Jenkins and
> Gerrit and more features on top.
>
> I propose the following work flow,
>
> - Developer pushes change to Gerrit.
>   - Zuul is notified by Gerrit of new change
> - Zuul runs pre-review checks on Jenkins. This will be the current smoke
> tests.
>   - Zuul reports back status of the checks to Gerrit.
>     - If checks fail, developer will need to resend the change after
> the required fixes. The process starts once more.
>     - If the checks pass, the change is now ready for review
> - The change is now reviewed by other developers and maintainers.
> Non-maintainers will be able to give only a +1 review.
>   - On a negative review, the developer will need to rework the change
> and resend it. The process starts once more.
> - The maintainer give a +2 review once he/she is satisfied. The
> maintainers work is done here.
>   - Zuul is notified of the +2 review
> - Zuul runs the regression runs and reports back the status.
>   - If the regression runs fail, the process starts over again.
>   - If the runs pass, the change is ready for acceptance.
> - Zuul will pick the change into the repository.
>   - If the pick fails, Zuul will report back the failure, and the
> process starts once again.
>
> Following this flow should,
> 1. Reduce regression wait time
> 2. Improve change acceptance time
> 3. Reduce unnecessary  wastage of infra resources
> 4. Improve infra stability.
>
> It also brings in drawbacks that we need to maintain one other piece
> of infra (Zuul). This would be an additional maintenance overhead on
> top of Gerrit, Jenkins and the current slaves. But I feel the
> reduction in the upkeep efforts of the slaves would be enough to
> offset this.
>
> tl;dr
> Current auto-triggering of regression runs is stupid and a waste of
> time and resources. Bring in a project gating system, Zuul, which can
> do a much more intelligent jobs triggering, and use it to
> automatically trigger regression only for changes with Reviewed+2 and
> automatically merge ones that pass.
>
> What does the community think of this?
>
> ~kaushal
>
> [1]:
> http://docs.openstack.org/infra/manual/developers.html#automated-testing
> [2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow
> [3]: http://docs.openstack.org/infra/zuul/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150820/25fe9936/attachment.html>