[Gluster-devel] Regression tests time

Tue Jan 23 20:58:12 UTC 2018

Hi,

I've made some experiments [1] with the time that centos regression takes
to complete. After some changes the time taken to run a full regression has
dropped between 2.5 and 3.5 hours (depending on the run time of 2 tests,
see below).

Basically the changes are related with delays manually introduced in some
places (sleeps in test files or even in the code, or delays in timer
events). I've changed some sleeps with better ways to detect some
condition, and I've left the delays in other places but with reduced time.
Probably the used values are not the best ones in all cases, but it
highlights that we should seriously consider how we detect things instead
of simply waiting for some amount of time (and hope it's enough). The total
test time is more than 2 hours less with these changes, so this means that
>2 hours of the whole regression time is spent waiting unnecessarily.

There are still some issues that I've been unable to solve. Probably the
most critical is the time taken by a couple of tests:

   - tests/bugs/nfs/bug-1053579.t
   - tests/bugs/fuse/many-groups-for-acl.t

These tests take around a minute if they work fine (~60 and ~45 seconds),
but sometimes they take a lot more time (~45 and ~30 minutes) but without
failing. The difference is in the time that it takes to create some system
groups and users.

For example, one of the things the first test does it to create 200 groups.
This is done in ~25 seconds on fast cases and in ~15 minutes on slow cases.
This means that sometimes, creating each group takes more than 4 seconds
while other times it takes around 100 milliseconds. This is > x30
difference.

I'm not sure what is the cause for this. If the slaves are connected to
some external kerberos or ldap source, maybe there are some network issues
(or service unavailability) at some times that cause timeouts or delays. In
my local system (Fedora 27) I see high CPU usage by process sssd_be during
group creation. I'm not sure why or if it also happens on slaves, but it
seems a good candidate. However in my system it seems to always take about
25 seconds to complete.

Even after the changes, tests are full of sleeps. There's one of 180
seconds (bugs/shard/parallel-truncate-read.t). Not sure if it's really
necessary, but there are many more with smaller delays between 1 and 60
seconds. Assuming that each sleep is only executed once, the total time
spent in sleeps is still 15 minutes.

I still need to fix some tests that seem to be failing often after the
changes.

Xavi

[1] https://review.gluster.org/19254
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180123/515d77fd/attachment.html>