[Gluster-devel] Need help diagnosing regression-test crashes

Fri Apr 8 17:03:55 UTC 2016

I've been trying to figure out what's causing CentOS regression tests to
fail with core dumps, quite clearly unrelated to the patches that are
being affected.  I've managed to find a couple of clues, so it seems
that maybe someone else will recognize something and zero in on the
problem faster than I can with my own digging.

The first clue is that the tests are being found at the conclusion of
the following test.

        bugs/stripe/bug-1002207.t

However, I think that's a bit of a red herring.  When I look in the logs
from that test, I find the following messages that seem related to the
crash.

> [2016-04-08 07:11:23.319610] E [MSGID: 101019]
> [xlator.c:430:xlator_init] 0-patchy-posix: Initialization of volume
> 'patchy-posix' failed, review your volfile again
> [2016-04-08 07:11:23.319628] E [MSGID: 101066]
> [graph.c:324:glusterfs_graph_init] 0-patchy-posix: initializing
> translator failed
> [2016-04-08 07:11:23.319643] E [MSGID: 101176]
> [graph.c:670:glusterfs_graph_activate] 0-graph: init failed
> [2016-04-08 07:11:23.320773] I [MSGID: 101190]
> [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 2
> pending frames:
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 11

This is consistent with some other things I've seen in these failures,
which are either in graph-teardown code or in socket code but either way
seem to occur immediately after we've failed to initialize a new graph.

Here's the interesting part.  Those lines came from this log file:

        bricks/d-backends-1-patchy_snap_mnt.log

This is a stripe test.  It doesn't do anything with snapshots.  However,
here's the test that runs immediately before it.

        bugs/snapshot/bug-1322772-real-path-fix-for-snapshot.t

That test clearly does have something to do with snapshots, and even
uses a name consistent with the name of the log file associated with the
failure.  Thus, in addition to whatever bug is actually causing the
process to crash, we seem to have a problem with snapshot processes from
one test persisting into the next.

That's where you, the reader, come in.  I have three questions.

 (a) Where should we look for the original bug that causes the crashes?

 (b) Is there another bug that's allowing snapshot processes to persist
     beyond their proper lifetime?

 (c) What should our test-infrastructure code should do to protect
     against the possibility of (b)?