[Gluster-devel] Need help diagnosing regression-test crashes

Fri Apr 8 20:35:27 UTC 2016

Upon further investigation, I've been able to determine that the problem
lies in this line of our generic cleanup routine.

        type cleanup_lvm &>/dev/null && cleanup_lvm || true;

This works great if snapshot.rc we're at the end of a test that included
snapshot.rc (which defines cleanup_lvm), but we've generally been moving
away from that in favor of calling it only at the beginning.  Thus, when
we go from a snapshot test to a non-snapshot test, the cleanup at the
beginning of the latter does *not* clean up any LVM stuff that's left
over.  What might have been a simple and correctly attributed failure in
the snapshot test can instead show up later.  In this case, the sequence
of events is as follows:

 1) bug-1322772 (snapshot) test starts glusterd

 2) bug-1322772 exits while the new glusterd is still initializing

 3) run-tests.sh looks for new core files and finds none

 4) run-tests.sh starts bug-1002207 (stripe) test

 5) glusterd from bug-1322772 dumps core

 6) bug-1002207 test completes

 7) run-tests.sh sees new core and misattributes it to bug-1002207

The question is what to do about this.  Unconditionally calling
lvm_cleanup from generic cleanup is simple, but might make regression
tests noticeably slower.  Another possibility would be to change all
snapshot tests to call cleanup (or at least cleanup_lvm) at the end, or
use bash's "trap" mechanism to ensure the same.  I'm not wild about any
of those, but lean toward the "trap" approach.  Anyone else have any
opinions?