[Gluster-devel] Brick-Mux tests failing for over 11+ weeks

Wed May 16 20:11:55 UTC 2018

On 05/16/2018 10:34 AM, Shyam Ranganathan wrote:
> Some further analysis based on what Mohit commented on the patch:
> 
> 1) gf_attach used to kill a brick is taking more time, causing timeouts
> in tests, mainly br-state-check.t. Usually when there are back to back
> kill_bricks in the test.

Invalid root cause. The failure seems to be the clients (in this test
case the bitrod daemon and scrubber) actually losing connection to other
bricks due to a ping timeout, when one of the (the first brick) is
terminated (or detached in mux parlance).

The above should not happen, and hence points back to glusterd and brick
daemon interaction causing some mayhem in this case.

> 
> 2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be that
> kill_brick has not completed before an attach request, causing it to be
> a duplicate attach and hence dropped/ignored? (speculation)
> 
> Writing a test case to see if this is reproducible in that short case!

The modified version of the test case did not help.

There was a core that I encountered which looks like the detach and
subsequent attach of a brick may have races, and hence cause some
disruption to the test case. Will send a followon to this mail about the
details there.

> 
> The above replicate test seems to also have a different issue when it
> compares the md5sums towards the end of the tests (can be seen in the
> console logs), which seems to be unrelated to brick-mux, (see:
> https://build.gluster.org/job/centos7-regression/853/console for
> example). Would be nice if someone from the replicate team took a look
> at this one.

Ran the test case as is on a local setup using the latest patch on
master. There is a failure in comparing md5sums across the bricks,
towards the end of the test, and this happens quite regularly (I would
state 1 in 4 tries). Replicate team has been made aware of the same, to
look into the problem better.

> 
> 3) ./tests/bugs/index/bug-1559004-EMLINK-handling.t seems to be a
> timeout in most (if not all cases), stuck in the last iteration.

Added timeout seems to have helped this case.

> 
> I will be modifying the patch (discussed in this thread) to add more
> time for 1 and 3 sfailures, and fire off a few more regressions, as I
> try to reproduce 2.
> 
> Shyam
> P.S: If work is happening on these issues, request that the
> data/analysis be posted to the lists, reduces rework!
> 
> On 05/15/2018 09:10 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> After the fix provided by Atin here [1] for the issue reported below, we
>> ran 7-8 runs of brick mux regressions against this fix, and we have had
>> 1/3 runs successful (even those have some tests retried). The run links
>> are in the review at [1].
>>
>> The failures are as below, sorted in descending order of frequency.
>> Requesting respective component owners/peers to take a stab at root
>> causing these, as the current pass rate is not sufficient to qualify the
>> release (or master) as stable.
>>
>> 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
>> this has the maximum instances of failures, including a core in the run [2])
>>
>> 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
>> please note, there are some failures in GFID comparison that seems
>> outside of mux cases as well)
>>
>> 3) ./tests/bugs/distribute/bug-1543279.t (Distribute)
>>
>> ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
>> the SCRIPT timeout on this, if someone can confirm looking at the runs
>> and failures, it would help determining the same)
>>
>> ------ We can possibly wait to analyze things below this line as the
>> instance count is 2 or less ------
>>
>> 4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>
>> ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
>>     ./tests/bugs/quota/bug-1293601.t
>>
>> 5)  ./tests/bugs/distribute/bug-1161311.t
>>     ./tests/bitrot/bug-1373520.t
>>
>> Thanks,
>> Shyam
>>
>> [1] Review containing the fix and the regression run links for logs:
>> https://review.gluster.org/#/c/20022/3
>>
>> [2] Test with core:
>> https://build.gluster.org/job/regression-on-demand-multiplex/20/
>> On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
>>> *** Calling out to Glusterd folks to take a look at this ASAP and
>>> provide a fix. ***
>>>
>>> Further to the mail sent yesterday, work done in my day with Johnny
>>> (RaghuB), points to a problem in glusterd rpc port map having stale
>>> entries for certain bricks as the cause for connection failures when
>>> running in the multiplex mode.
>>>
>>> It seems like this problem has been partly addressed in this bug:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
>>>
>>> What is occurring now is that glusterd retains older ports in its
>>> mapping table against bricks that have recently terminated, when a
>>> volume is stopped and restarted, this leads to connection failures from
>>> clients as there are no listeners on the now stale port.
>>>
>>> Test case as in [1], when run on my F27 machine fails 1 in 5 times with
>>> the said error.
>>>
>>> The above does narrow down failures in tests:
>>> - lk-quorum.t
>>> - br-state-check.t
>>> - entry-self-heal.t
>>> - bug-1363721.t (possibly)
>>>
>>> Failure from client mount logs can be seen as using the wrong port
>>> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
>>> 6-patchy-client-2: changing port to 49156 (from 0)" when there are
>>> failures, the real port for the brick-mux process would be different.
>>>
>>> We also used gdb to inspect glusterd pmap registry and found that older
>>> stale port map data is present (in function pmap_registry_search as
>>> clients invoke a connection).
>>>
>>> Thanks,
>>> Shyam
>>>
>>> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>>>> Hi,
>>>>
>>>> Nigel pointed out that the nightly brick-mux tests are now failing for
>>>> about 11 weeks and we do not have a clear run of the same.
>>>>
>>>> Spent some time on Friday collecting what tests failed and to an extent
>>>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>>>
>>>> Asks: Whoever has cycles please look into these failures ASAP as these
>>>> tests failing are blockers for 4.1 release, and overall the state of
>>>> master (and hence 4.1 release branch) are not clean when these tests are
>>>> failing for over 11 weeks.
>>>>
>>>> Most of the tests fail if run on a local setup as well, so debugging the
>>>> same should be easier than requiring the mux or regression setup, just
>>>> ensure that mux is turned on (either by default in the code base you are
>>>> testing or in the test case adding the line `TEST $CLI volume set all
>>>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>>>
>>>> 1) A lot of test cases time out, of which, the following 2 have the most
>>>> failures, and hence possibly can help with the debugging of the root
>>>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>>>> failures do not seem to bein replicate or client side layers (at present).
>>>>
>>>> (number in brackets is # times this failed in the last 13 instances of
>>>> mux testing)
>>>> ./tests/basic/afr/entry-self-heal.t (4)
>>>> ./tests/bitrot/br-state-check.t (8)
>>>>
>>>> 2)
>>>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>>>
>>>> The above test constantly fails at this point:
>>>> ------------
>>>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>>>> of a volume
>>>> 16:46:28 not ok 25 , LINENUM:47
>>>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>>>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
>>>> ------------
>>>>
>>>> From the logs the failure is occurring from here:
>>>> ------------
>>>> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
>>>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
>>>> 0-management: Failed to set extended attribute trusted.add-brick :
>>>> Transport endpoint is not connected [Transport endpoint is not connected]
>>>> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
>>>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
>>>> add bricks
>>>> ------------
>>>>
>>>> This seems like the added brick is not accepting connections.
>>>>
>>>> 3) The following tests also show similar behaviour to (2), where the AFR
>>>> checks for brick up fails after timeout, as the birck is not accepting
>>>> connections.
>>>>
>>>> ./tests/bugs/replicate/bug-1363721.t (4)
>>>> ./tests/basic/afr/lk-quorum.t (5)
>>>>
>>>> I would suggest someone familiar with mux process and also brick muxing
>>>> look at these from the initialization/RPC/socket front, as these seem to
>>>> be bricks that do not show errors in the logs but are failing connections.
>>>>
>>>> As we find different root causes, we may want different bugs than the
>>>> one filed, please do so and post patches in an effort to move this forward.
>>>>
>>>> Thanks,
>>>> Shyam
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>