[Gluster-infra] Reduce regression runs wait time - New gerrit/review work flow

Wed Jun 17 20:20:34 UTC 2015

On Mon, Jun 15, 2015 at 04:19:14PM +0530, Kaushal M wrote:
> Hi all,
...
> I propose that we move back to the old model of starting regression
> runs only once the maintainers are ready to merge. But instead of the
> maintainers manually tiggering the runs, we could automate it.

I think auto triggering regression tests is good. We should ask the
developers to run regression tests before posting complex changes. If
the parallelisation of regression tests is done, the wait time should
reduce too.

As a maintainer that spends quite some time reviewing patches, I prefer
to see a +1 verified before I start to review something. With that, I at
least have some confidence that there are no obvious mistakes I need to
point out. If developers have to wait on me before regression testing
gets started, I feel more like a block on road than helping them. There
really are *many* patches that get a FAILED result where there is a
problem in the code. Developers should get a response about that as soon
as possible, and waiting for a maintainer to start the regression tests
does not help.

I also had to ask maintainers for triggering regression tests for my
first patches, it is not a nice experience. Anything we can do to
improve the experience for (new) developers should be done, delaying
(auotmated) feedback isnt a step in the right direction.

> We can model our new workflow on those of OpenStack[1] and
> Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn't provide
> the features necessary to enable selective triggering based on Gerrit
> flags. Both OpenStack and Wikimedia use a project gating tool called
> Zuul[3], which provides a much better integration with Jenkins and
> Gerrit and more features on top.

More intelligent triggering would be helpful. Unfortunately we have a
stack of xlators and it is difficult to say if there are unintended
side-effects in different, untouched pieces of the code.

> I propose the following work flow,
> 
> - Developer pushes change to Gerrit.
>   - Zuul is notified by Gerrit of new change
> - Zuul runs pre-review checks on Jenkins. This will be the current smoke tests.
>   - Zuul reports back status of the checks to Gerrit.
>     - If checks fail, developer will need to resend the change after
> the required fixes. The process starts once more.
>     - If the checks pass, the change is now ready for review
> - The change is now reviewed by other developers and maintainers.
> Non-maintainers will be able to give only a +1 review.
>   - On a negative review, the developer will need to rework the change
> and resend it. The process starts once more.
> - The maintainer give a +2 review once he/she is satisfied. The
> maintainers work is done here.
>   - Zuul is notified of the +2 review
> - Zuul runs the regression runs and reports back the status.
>   - If the regression runs fail, the process starts over again.
>   - If the runs pass, the change is ready for acceptance.
> - Zuul will pick the change into the repository.
>   - If the pick fails, Zuul will report back the failure, and the
> process starts once again.

It would be nice if Zuul, in its last step, can pick the change on top
of the latest HEAD, run the build/smoke test again, and only push the
change when all is OK. We have seen patch/merge races where a
function/define was changed, and an other patch used that
function/define. These caused much issues when the branch failed to
compile. Being able to prevent that would be very good.

> Following this flow should,
> 1. Reduce regression wait time

"wait time" for what or who? The merging of the patch would still only
happen after all tests are done. If something fails the last test, more
people (reviewers and maintainer) need to spend additional time.

> 2. Improve change acceptance time
> 3. Reduce unnecessary  wastage of infra resources

We could, and should optimize that in our parallel testing and educating
develpers to only re-run regressions when needed. Splitting up the
regression tests also makes it possible to only re-run a small part of
the tests.

> 4. Improve infra stability.

Not sure if adding an other component and (complex?) configuration adds
to "Improve infra stability". It would be nice to have a very minimal
set of tools, and many people understanding them. With the current
Gerrit and Jenkins configuration we have, we seem to be already very
limited on people that can investigate issues.

> It also brings in drawbacks that we need to maintain one other piece
> of infra (Zuul). This would be an additional maintenance overhead on
> top of Gerrit, Jenkins and the current slaves. But I feel the
> reduction in the upkeep efforts of the slaves would be enough to
> offset this.
> 
> tl;dr
> Current auto-triggering of regression runs is stupid and a waste of
> time and resources. Bring in a project gating system, Zuul, which can
> do a much more intelligent jobs triggering, and use it to
> automatically trigger regression only for changes with Reviewed+2 and
> automatically merge ones that pass.
> 
> What does the community think of this?

It is a good suggestion, but I would wait with spending time on this
until we see some results from the parallel testing. Also, as Michael
mentioned, we should aim for tools that are packaged for EPEL-7, so that
we can have our infrastructure well managed on RHEL or CentOS systems.

Thanks,
Niels

> ~kaushal
> 
> [1]: http://docs.openstack.org/infra/manual/developers.html#automated-testing
> [2]: https://www.mediawiki.org/wiki/Continuous_integration/Workflow
> [3]: http://docs.openstack.org/infra/zuul/
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-infra