<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy <span dir="ltr">&lt;<a href="mailto:jeff@pl.atyp.us" target="_blank">jeff@pl.atyp.us</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>


<div><span class="gmail-"><div style="font-family:Arial"><br></div>

<div><br></div>

<div><br></div>

<div>On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:<br></div>

<blockquote type="cite"><div dir="ltr"><div><div><div>That happens when we use arbitrary delays. If we use an explicit check, it will work on all systems.<br></div>

</div>

</div>

</div>

</blockquote><div style="font-family:Arial"><br></div>

</span><div style="font-family:Arial">You&#39;re arguing against a position not taken. I&#39;m not expressing opposition to explicit checks. I&#39;m just saying they don&#39;t come for free. If you don&#39;t believe me, try adding explicit checks in some of the harder cases where we&#39;re waiting for something that&#39;s subject to OS scheduling delays, or for large numbers of operations to complete. Geo-replication or multiplexing tests should provide some good examples. Adding explicit conditions is the right thing to do in the abstract, but as a practical matter the returns must justify the cost.<br></div>

<div style="font-family:Arial"><br></div>

<div style="font-family:Arial">BTW, some of our longest-running tests are in EC. Do we need all of those, and do they all need to run as long, or could some be eliminated/shortened?</div></div></blockquote><div><br></div><div>Some tests were already removed some time ago. Anyway, with the changes introduced, it takes between 10 and 15 minutes to execute all ec related tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test). Before the changes, the same tests were taking between 30 and 60 minutes.</div><div><br></div><div>AFR tests have also improved from almost 60 minutes to around 30.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><span class="gmail-">

<div style="font-family:Arial"><br></div>

<blockquote type="cite"><div dir="ltr"><div><div><div>I agree that parallelizing tests is the way to go, but if we reduce the total time to 50%, the parallelized tests will also take 50% less of the time.<br></div>

</div>

</div>

</div>

</blockquote><div style="font-family:Arial"><br></div>

</span><div style="font-family:Arial">Taking 50% less time but failing spuriously 1% of the time, or all of the time in some environments, is not a good thing. If you want to add explicit checks that&#39;s great, but you also mentioned shortening timeouts and that&#39;s much more risky.<br></div>

</div>


</blockquote></div><br></div><div class="gmail_extra">If we have a single test that takes 45 minutes (as we currently have in some executions: bugs/nfs/bug-1053579.t), parallelization won&#39;t help much. We need to make this test to run faster.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Some tests that were failing after the changes have revealed errors in the test itself or even in the code, so I think it&#39;s a good thing. Currently I&#39;m investigating what seems a race in the rpc layer during connections that causes some tests to fail. This is a real problem that high delays or slow machines were hiding. It seems to cause some gluster requests to fail spuriously after reconnecting to a brick or glusterd. I&#39;m not 100% sure about this yet, but initial analysis seems to indicate that.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Xavi</div></div>