[Gluster-devel] good job on fixing heavy hitters in spurious regressions

Fri May 8 12:16:55 UTC 2015

> The deluge of regression failures is a direct consequence of last minute
> merges during (extended) feature freeze. We did well to contain this. Great
> stuff!
> If we want to avoid this we should not accept (large) feature merges just
> before feature freeze.

I would add that we shouldn't accept *many* feature merges just before
freeze either.  Even if each one is small, their interactions tend to
multiply exponentially.  For example, many of the failures recently have
been timing-related.  The cause has not been a single patch making
everything significantly slower.  Rather, it has been an accumulation
of smaller slow-downs in multiple patches.  If we had staged those
changes better, debugging and fixing those problems in smaller batches
would have been much easier.

> > Here are some of the things that I can think of: 0) Maintainers
> > should also maintain tests that are in their component.
>
> It is not possible for me as glusterd co-maintainer to 'maintain'
> tests that are added under tests/bugs/glusterd. Most of them don't
> test core glusterd functionality.  They are almost always tied to a
> particular feature whose implementation had bugs in its glusterd code.
> I would expect the test authors (esp. the more recent ones) to chip
> in.

Good point.  Nobody should be penalized for having code that everyone
else touches (or rewarded for having code that nobody dares to).
First responsibility for debugging a regression-test failure lies
with the owner of the patch that failed.  If they determine that the
failure is spurious - which is easy if it's already on a list - then
responsibility falls to the owner of the test.  Either should be able
to draw on the expertise of others in the group, but that doesn't
shift *responsibility*.  Only when a problem has been tracked down to
a particular piece of production code should responsibility move
again - either to the person whose earlier patch caused the breakage,
or to the subsystem maintainer.

Mostly this is just common sense.  Perhaps the change that's needed
is to make the fixing of likely-spurious test failures a higher
priority than adding new features.  That has to be reflected not
only in Bugzilla, but also in how we schedule individual developers'
time and evaluate their progress toward goals.