[Gluster-devel] Difference in bad_tests count in mainline vs 3.7 branch

Fri Sep 4 07:26:07 UTC 2015

> Maintainers - can you please take stock of this and ensure sanity of your
> components before merging patches that do not fix a failing test?
>
>
Here is my proposal to get this fixed.

This weekend, 5th September 0400 UTC, I will start a jenkins run on master
and 3.7 branches.

   - It will be re-based with code just before it is run, so all patches
   merged by 4th September would be tested.
   - It will run each test for 10 times in succession. Why 10?
      - Hope to find tests that fail occasionally.
      - If the tests fails only for 1st run, it could very well be a
      cleanup issue with last run test.
      - Failures within the 10 runs in a pattern is again indicative of
      some cleanup/timeout error.
   - It will run all tests and not stop at the first failure.
   - I will have scripts modified to get maximum data from logs. (It will
   still be INFO level logs)

After the test completes, I will file a bug against the component of the .t
tests that fail in this run and immediately add the test to bad tests list.

What should the maintainers do after that?

   - If a bug is filed against your component, please spend some time on
   Monday and root cause the issue by Monday EOD.
   - If the root cause proves that the bug is in .t file
      - It is would be mostly because
         - The timeouts are not enough all the time. Change EXPECT_WITHIN
         values and check.
         - The test is not deterministic enough ; some of the assumptions
         that test makes might not always be true. For example, a
SIGTERM followed
         by a TEST which assumes that process is definitely killed is a wrong
         assumption. Use SIGKILL in such cases. (I know SIGKILL may
not work too if
         the process is in D state, but its a good enough example)
      - It is easier to fix bugs in.t once the root cause is found. Please
      fix the issue and remove it from bad tests list. Use the bug
filed against
      this .t file.
   - If the root cause proves that the bug is in Gluster code:
      - If the bug is in same component as the .t file:
         - In this case, you are the component owner, change the
         description and summary of the bug filed to indicate the actual issue.
         - If the time required to fix the issue in Gluster code is
         non-minimal
            - Put a workaround in .t file with a comment clearly stating
            the bug number which would later fix it and remove the
test from bad test
            list.
            - If a workaround is not possible let the test remain in bad
            test list.
         - If the bug is not in same component as the .t file:
         - Update the bug with details which prove that bug is not in the
         same component and change the component accordingly.
         - It is new owner's responsibility to provide a workaround for all
         .t files hit by the issue and fix the code.

Note to all maintainers:

   - I would request everyone to resist merging patches this weekend unless
   critically required. It would help us in debugging on Monday.

Lets hope that when we do a similar jenkins run on next weekend, September
12th, we don't find any failures.

Suggestions welcome for any changes in the above plan.

Thanks,
Raghavendra Talur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150904/9099a016/attachment.html>