[Gluster-infra] Reduce regression runs wait time - New gerrit/review work flow

Wed Jun 17 15:41:56 UTC 2015

Le lundi 15 juin 2015 à 16:19 +0530, Kaushal M a écrit :
> Hi all,
> 
> The recent rush of reviews being sent due to the release of 3.7 was a
> cause of frustration for many of us because of the regression tests
> (gerrit troubles themselves are another thing).
> 
> W.R.T regression 3 main sources of frustration were,
> 1. Spurious test failures
> 2. Long wait times
> 3. Regression slave troubles
> 
> We've already tackled the spurious failure issue and are quite stable
> now. The trouble with the slave vms is related to the gerrit issues,
> and is mainly due to the network issues we are having between the
> data-centers hosting the slaves and gerrit/jenkins. People have been
> looking into this, but we haven't had much success. This leaves the
> issue of the long wait times.
> 
> The long wait times are because of the long queues of pending jobs,
> some of which take days to get scheduled. Two things cause the long
> queues,
> 1. Automatic regression job triggering for all submissions to gerrit
> 2. Long run time for regression (~2h)
> 
> The long queues coupled with the spurious failure and network
> problems, meant that jobs would fail for no reason after a long wait,
> and would have to be added to the back of the queue to be re-run. This
> meant that developers would have to wait days for their changes to get
> merged, and was one of the causes for the delay in the release of 3.7.
> 
> The solution reduce wait times for regression runs. To reduce wait
> times we should,
> 1. Trigger runs only when required
> 2. Reduce regression run time.
> 
> Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his
> findings on the regression run times, and we can continue discussion
> on it on that thread.
> 
> Earlier, the regression runs used to be manually triggered by the
> maintainers once they had determined that a change was ready for
> submission. But as there were only two maintainers before (Vijay and
> Avati) auto triggering was brought in to reduce their load. Auto
> triggering worked fine when we had a lower volume of changes being
> submitted to gerrit. But now, with the large volumes we see during the
> release freeze dates, auto triggering just adds to problems.
> 
> I propose that we move back to the old model of starting regression
> runs only once the maintainers are ready to merge. But instead of the
> maintainers manually tiggering the runs, we could automate it.
> 
> We can model our new workflow on those of OpenStack[1] and
> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide
> the features necessary to enable selective triggering based on Gerrit
> flags. Both OpenStack and Wikimedia use a project gating tool called
> Zuul[3], which provides a much better integration with Jenkins and
> Gerrit and more features on top.
> 
> I propose the following work flow,
> 
> - Developer pushes change to Gerrit.
>   - Zuul is notified by Gerrit of new change
> - Zuul runs pre-review checks on Jenkins. This will be the current smoke tests.
>   - Zuul reports back status of the checks to Gerrit.
>     - If checks fail, developer will need to resend the change after
> the required fixes. The process starts once more.
>     - If the checks pass, the change is now ready for review
> - The change is now reviewed by other developers and maintainers.
> Non-maintainers will be able to give only a +1 review.
>   - On a negative review, the developer will need to rework the change
> and resend it. The process starts once more.
> - The maintainer give a +2 review once he/she is satisfied. The
> maintainers work is done here.
>   - Zuul is notified of the +2 review
> - Zuul runs the regression runs and reports back the status.
>   - If the regression runs fail, the process starts over again.
>   - If the runs pass, the change is ready for acceptance.
> - Zuul will pick the change into the repository.
>   - If the pick fails, Zuul will report back the failure, and the
> process starts once again.
> 
> Following this flow should,
> 1. Reduce regression wait time
> 2. Improve change acceptance time
> 3. Reduce unnecessary  wastage of infra resources
> 4. Improve infra stability.
> 
> It also brings in drawbacks that we need to maintain one other piece
> of infra (Zuul). This would be an additional maintenance overhead on
> top of Gerrit, Jenkins and the current slaves. But I feel the
> reduction in the upkeep efforts of the slaves would be enough to
> offset this.
> 
> tl;dr
> Current auto-triggering of regression runs is stupid and a waste of
> time and resources. Bring in a project gating system, Zuul, which can
> do a much more intelligent jobs triggering, and use it to
> automatically trigger regression only for changes with Reviewed+2 and
> automatically merge ones that pass.
> 
> What does the community think of this?

Zuul is being packaged for Fedora/EPEL, so it would greatly help to have
it packaged rather that a non sustainable self installation like we had
in the past.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20150617/0bd42f98/attachment.sig>