[Gluster-devel] Brick-Mux tests failing for over 11+ weeks

Sun May 13 22:56:05 UTC 2018

Hi,

Nigel pointed out that the nightly brick-mux tests are now failing for
about 11 weeks and we do not have a clear run of the same.

Spent some time on Friday collecting what tests failed and to an extent
why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672

Asks: Whoever has cycles please look into these failures ASAP as these
tests failing are blockers for 4.1 release, and overall the state of
master (and hence 4.1 release branch) are not clean when these tests are
failing for over 11 weeks.

Most of the tests fail if run on a local setup as well, so debugging the
same should be easier than requiring the mux or regression setup, just
ensure that mux is turned on (either by default in the code base you are
testing or in the test case adding the line `TEST $CLI volume set all
cluster.brick-multiplex on` after any cleanup and post starting glusterd.

1) A lot of test cases time out, of which, the following 2 have the most
failures, and hence possibly can help with the debugging of the root
cause faster. Request Glusterd and bitrot teams to look at this, as the
failures do not seem to bein replicate or client side layers (at present).

(number in brackets is # times this failed in the last 13 instances of
mux testing)
./tests/basic/afr/entry-self-heal.t (4)
./tests/bitrot/br-state-check.t (8)

2)
./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)

The above test constantly fails at this point:
------------
16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
of a volume
16:46:28 not ok 25 , LINENUM:47
16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
------------

>From the logs the failure is occurring from here:
------------
[2018-05-03 16:47:12.728893] E [MSGID: 106053]
[glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
0-management: Failed to set extended attribute trusted.add-brick :
Transport endpoint is not connected [Transport endpoint is not connected]
[2018-05-03 16:47:12.741438] E [MSGID: 106073]
[glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
add bricks
------------

This seems like the added brick is not accepting connections.

3) The following tests also show similar behaviour to (2), where the AFR
checks for brick up fails after timeout, as the birck is not accepting
connections.

./tests/bugs/replicate/bug-1363721.t (4)
./tests/basic/afr/lk-quorum.t (5)

I would suggest someone familiar with mux process and also brick muxing
look at these from the initialization/RPC/socket front, as these seem to
be bricks that do not show errors in the logs but are failing connections.

As we find different root causes, we may want different bugs than the
one filed, please do so and post patches in an effort to move this forward.

Thanks,
Shyam