[Gluster-Maintainers] [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

Wed May 16 01:10:07 UTC 2018

Hi,

After the fix provided by Atin here [1] for the issue reported below, we
ran 7-8 runs of brick mux regressions against this fix, and we have had
1/3 runs successful (even those have some tests retried). The run links
are in the review at [1].

The failures are as below, sorted in descending order of frequency.
Requesting respective component owners/peers to take a stab at root
causing these, as the current pass rate is not sufficient to qualify the
release (or master) as stable.

1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
this has the maximum instances of failures, including a core in the run [2])

2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
please note, there are some failures in GFID comparison that seems
outside of mux cases as well)

3) ./tests/bugs/distribute/bug-1543279.t (Distribute)

./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
the SCRIPT timeout on this, if someone can confirm looking at the runs
and failures, it would help determining the same)

------ We can possibly wait to analyze things below this line as the
instance count is 2 or less ------

4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t

./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
    ./tests/bugs/quota/bug-1293601.t

5)  ./tests/bugs/distribute/bug-1161311.t
    ./tests/bitrot/bug-1373520.t

Thanks,
Shyam

[1] Review containing the fix and the regression run links for logs:
https://review.gluster.org/#/c/20022/3

[2] Test with core:
https://build.gluster.org/job/regression-on-demand-multiplex/20/
On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
> *** Calling out to Glusterd folks to take a look at this ASAP and
> provide a fix. ***
> 
> Further to the mail sent yesterday, work done in my day with Johnny
> (RaghuB), points to a problem in glusterd rpc port map having stale
> entries for certain bricks as the cause for connection failures when
> running in the multiplex mode.
> 
> It seems like this problem has been partly addressed in this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
> 
> What is occurring now is that glusterd retains older ports in its
> mapping table against bricks that have recently terminated, when a
> volume is stopped and restarted, this leads to connection failures from
> clients as there are no listeners on the now stale port.
> 
> Test case as in [1], when run on my F27 machine fails 1 in 5 times with
> the said error.
> 
> The above does narrow down failures in tests:
> - lk-quorum.t
> - br-state-check.t
> - entry-self-heal.t
> - bug-1363721.t (possibly)
> 
> Failure from client mount logs can be seen as using the wrong port
> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
> 6-patchy-client-2: changing port to 49156 (from 0)" when there are
> failures, the real port for the brick-mux process would be different.
> 
> We also used gdb to inspect glusterd pmap registry and found that older
> stale port map data is present (in function pmap_registry_search as
> clients invoke a connection).
> 
> Thanks,
> Shyam
> 
> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> Nigel pointed out that the nightly brick-mux tests are now failing for
>> about 11 weeks and we do not have a clear run of the same.
>>
>> Spent some time on Friday collecting what tests failed and to an extent
>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>
>> Asks: Whoever has cycles please look into these failures ASAP as these
>> tests failing are blockers for 4.1 release, and overall the state of
>> master (and hence 4.1 release branch) are not clean when these tests are
>> failing for over 11 weeks.
>>
>> Most of the tests fail if run on a local setup as well, so debugging the
>> same should be easier than requiring the mux or regression setup, just
>> ensure that mux is turned on (either by default in the code base you are
>> testing or in the test case adding the line `TEST $CLI volume set all
>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>
>> 1) A lot of test cases time out, of which, the following 2 have the most
>> failures, and hence possibly can help with the debugging of the root
>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>> failures do not seem to bein replicate or client side layers (at present).
>>
>> (number in brackets is # times this failed in the last 13 instances of
>> mux testing)
>> ./tests/basic/afr/entry-self-heal.t (4)
>> ./tests/bitrot/br-state-check.t (8)
>>
>> 2)
>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>
>> The above test constantly fails at this point:
>> ------------
>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>> of a volume
>> 16:46:28 not ok 25 , LINENUM:47
>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
>> ------------
>>
>> From the logs the failure is occurring from here:
>> ------------
>> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
>> 0-management: Failed to set extended attribute trusted.add-brick :
>> Transport endpoint is not connected [Transport endpoint is not connected]
>> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
>> add bricks
>> ------------
>>
>> This seems like the added brick is not accepting connections.
>>
>> 3) The following tests also show similar behaviour to (2), where the AFR
>> checks for brick up fails after timeout, as the birck is not accepting
>> connections.
>>
>> ./tests/bugs/replicate/bug-1363721.t (4)
>> ./tests/basic/afr/lk-quorum.t (5)
>>
>> I would suggest someone familiar with mux process and also brick muxing
>> look at these from the initialization/RPC/socket front, as these seem to
>> be bricks that do not show errors in the logs but are failing connections.
>>
>> As we find different root causes, we may want different bugs than the
>> one filed, please do so and post patches in an effort to move this forward.
>>
>> Thanks,
>> Shyam
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel