[Gluster-devel] Regression tests time

Thu Jan 25 14:37:27 UTC 2018

On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy <jeff at pl.atyp.us> wrote:

>
>
>
> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>
> That happens when we use arbitrary delays. If we use an explicit check, it
> will work on all systems.
>
>
> You're arguing against a position not taken. I'm not expressing opposition
> to explicit checks. I'm just saying they don't come for free. If you don't
> believe me, try adding explicit checks in some of the harder cases where
> we're waiting for something that's subject to OS scheduling delays, or for
> large numbers of operations to complete. Geo-replication or multiplexing
> tests should provide some good examples. Adding explicit conditions is the
> right thing to do in the abstract, but as a practical matter the returns
> must justify the cost.
>
> BTW, some of our longest-running tests are in EC. Do we need all of those,
> and do they all need to run as long, or could some be eliminated/shortened?
>

Some tests were already removed some time ago. Anyway, with the changes
introduced, it takes between 10 and 15 minutes to execute all ec related
tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
Before the changes, the same tests were taking between 30 and 60 minutes.

AFR tests have also improved from almost 60 minutes to around 30.

> I agree that parallelizing tests is the way to go, but if we reduce the
> total time to 50%, the parallelized tests will also take 50% less of the
> time.
>
>
> Taking 50% less time but failing spuriously 1% of the time, or all of the
> time in some environments, is not a good thing. If you want to add explicit
> checks that's great, but you also mentioned shortening timeouts and that's
> much more risky.
>

If we have a single test that takes 45 minutes (as we currently have in
some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
We need to make this test to run faster.

Some tests that were failing after the changes have revealed errors in the
test itself or even in the code, so I think it's a good thing. Currently
I'm investigating what seems a race in the rpc layer during connections
that causes some tests to fail. This is a real problem that high delays or
slow machines were hiding. It seems to cause some gluster requests to fail
spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
about this yet, but initial analysis seems to indicate that.

Xavi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180125/7ac1397d/attachment.html>