[Gluster-devel] good job on fixing heavy hitters in spurious regressions

Fri May 8 18:09:17 UTC 2015

On 05/08/2015 01:27 PM, Pranith Kumar Karampuri wrote:
>
> On 05/08/2015 06:45 PM, Shyam wrote:
>> On 05/08/2015 08:16 AM, Jeff Darcy wrote:
>>>>> Here are some of the things that I can think of: 0) Maintainers
>>>>> should also maintain tests that are in their component.
>>>>
>>>> It is not possible for me as glusterd co-maintainer to 'maintain'
>>>> tests that are added under tests/bugs/glusterd. Most of them don't
>>>> test core glusterd functionality.  They are almost always tied to a
>>>> particular feature whose implementation had bugs in its glusterd code.
>>>> I would expect the test authors (esp. the more recent ones) to chip
>>>> in.
>>>
>>> Good point.  Nobody should be penalized for having code that everyone
>>> else touches (or rewarded for having code that nobody dares to).
>>> First responsibility for debugging a regression-test failure lies
>>> with the owner of the patch that failed.  If they determine that the
>>> failure is spurious - which is easy if it's already on a list - then
>>> responsibility falls to the owner of the test.  Either should be able
>>> to draw on the expertise of others in the group, but that doesn't
>>> shift *responsibility*.  Only when a problem has been tracked down to
>>> a particular piece of production code should responsibility move
>>> again - either to the person whose earlier patch caused the breakage,
>>> or to the subsystem maintainer.
>>
>> +1, could not have said it better, but chipping in my vote for the
>> order, owner of patch first -> owner of test -> maintainer of module,
>> with the said responsibility in place.
> I don't know man, this order failed me so many times in the past with
> spurious failures for both 3.6.0 (Sent around 15 patches at that time to
> fix spurious failures) and now for 3.7.0. also had to spend reasonable
> amount of time including the long weekend keeping away from ec
> deliverables for this release. I am very happy that lot more people
> helped solve the problems this time around because I feel they
> experienced the pain of screwed up regressions and realize the
> importance of fixing them.
> I'll tell you what experiences lead me to suggest that the maintainer
> take this responsibility.
>
> I submit a patch for new-component/changing log-level of one of the logs
> for which there is not a single caller after you moved it from INFO ->
> DEBUG. So the code is not at all going to be executed. Yet the
> regressions will fail. I am 100% sure it has nothing to do with my
> patch. I neither have time nor expertise to debug the test that I have
> no clue about, so the least I can do is to intimate people who may do
> something about it i.e. owner of test or maintainer of module. You feel
> lets ask the owner of the test about what the problem is, owner of the
> test moves on to different component and is busy with their own work. So
> you are left with going to the maintainer who tells you so and so is the
> problem and so and so is the reason as soon as you show the test number,
> so you end up feeling why didn't I ask him/her first.

I agree on the experience and have no questions/comments on that.

The deal is though, we at least had people (including myself) ignoring 
spurious failures, re-triggering jobs, to get that +1 V and move on. 
Which causes issues, as the failures could have been at least flagged 
for others to be aware of and take forward. Which is the reason the 
patch submitter is the first to take a stab at correcting the problem.

We can always mail the owner next and then move on to the maintainer 
(which is what you did).

I would say we stop any merge if its history shows no analysis of 
regression failures (assuming regression failures occurred). This is a 
maintainer role that does this. (changing tracks: Let's please be more 
vocal in Gerrit, we can mark issues addressed as "Done" and also write a 
few lines on what is going on in terms of tests done etc.)

On people moving on and having no time for older areas/code, I guess we 
have to be stronger maintainers, as they are the only ones left to deal 
with the problem then. In which case, we should deal with this earlier 
than later, mandating _quality_ test cases for changes, than accepting 
things on the face of the code. Of course this is not correctable in 
retrospection now, so something for the future.

>
>>>
>>> Mostly this is just common sense.  Perhaps the change that's needed
>>> is to make the fixing of likely-spurious test failures a higher
>>> priority than adding new features.  That has to be reflected not
>>> only in Bugzilla, but also in how we schedule individual developers'
>>> time and evaluate their progress toward goals.
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>