[Gluster-Maintainers] [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

Tue May 15 00:31:34 UTC 2018

*** Calling out to Glusterd folks to take a look at this ASAP and
provide a fix. ***

Further to the mail sent yesterday, work done in my day with Johnny
(RaghuB), points to a problem in glusterd rpc port map having stale
entries for certain bricks as the cause for connection failures when
running in the multiplex mode.

It seems like this problem has been partly addressed in this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1545048

What is occurring now is that glusterd retains older ports in its
mapping table against bricks that have recently terminated, when a
volume is stopped and restarted, this leads to connection failures from
clients as there are no listeners on the now stale port.

Test case as in [1], when run on my F27 machine fails 1 in 5 times with
the said error.

The above does narrow down failures in tests:
- lk-quorum.t
- br-state-check.t
- entry-self-heal.t
- bug-1363721.t (possibly)

Failure from client mount logs can be seen as using the wrong port
number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
6-patchy-client-2: changing port to 49156 (from 0)" when there are
failures, the real port for the brick-mux process would be different.

We also used gdb to inspect glusterd pmap registry and found that older
stale port map data is present (in function pmap_registry_search as
clients invoke a connection).

Thanks,
Shyam

On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
> Hi,
> 
> Nigel pointed out that the nightly brick-mux tests are now failing for
> about 11 weeks and we do not have a clear run of the same.
> 
> Spent some time on Friday collecting what tests failed and to an extent
> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
> 
> Asks: Whoever has cycles please look into these failures ASAP as these
> tests failing are blockers for 4.1 release, and overall the state of
> master (and hence 4.1 release branch) are not clean when these tests are
> failing for over 11 weeks.
> 
> Most of the tests fail if run on a local setup as well, so debugging the
> same should be easier than requiring the mux or regression setup, just
> ensure that mux is turned on (either by default in the code base you are
> testing or in the test case adding the line `TEST $CLI volume set all
> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
> 
> 1) A lot of test cases time out, of which, the following 2 have the most
> failures, and hence possibly can help with the debugging of the root
> cause faster. Request Glusterd and bitrot teams to look at this, as the
> failures do not seem to bein replicate or client side layers (at present).
> 
> (number in brackets is # times this failed in the last 13 instances of
> mux testing)
> ./tests/basic/afr/entry-self-heal.t (4)
> ./tests/bitrot/br-state-check.t (8)
> 
> 2)
> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
> 
> The above test constantly fails at this point:
> ------------
> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
> of a volume
> 16:46:28 not ok 25 , LINENUM:47
> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
> ------------
> 
> From the logs the failure is occurring from here:
> ------------
> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
> 0-management: Failed to set extended attribute trusted.add-brick :
> Transport endpoint is not connected [Transport endpoint is not connected]
> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
> add bricks
> ------------
> 
> This seems like the added brick is not accepting connections.
> 
> 3) The following tests also show similar behaviour to (2), where the AFR
> checks for brick up fails after timeout, as the birck is not accepting
> connections.
> 
> ./tests/bugs/replicate/bug-1363721.t (4)
> ./tests/basic/afr/lk-quorum.t (5)
> 
> I would suggest someone familiar with mux process and also brick muxing
> look at these from the initialization/RPC/socket front, as these seem to
> be bricks that do not show errors in the logs but are failing connections.
> 
> As we find different root causes, we may want different bugs than the
> one filed, please do so and post patches in an effort to move this forward.
> 
> Thanks,
> Shyam
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mux-failure.t
Type: application/x-perl
Size: 1829 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20180514/5f02bc44/attachment.pl>