[Gluster-Maintainers] [Gluster-devel] Release 5: Master branch health report (Week of 30th July)

Fri Aug 3 14:50:02 UTC 2018

On 31 July 2018 at 22:11, Atin Mukherjee <amukherj at redhat.com> wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
> ============================================================
> ============================================================
> =================================================
> Fails only with brick-mux
> ============================================================
> ============================================================
> =================================================
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
> 400 secs. Refer https://fstat.gluster.org/failure/209?state=2&start_
> date=2018-06-30&end_date=2018-07-31&branch=all, specifically the latest
> report https://build.gluster.org/job/regression-test-burn-in/4051/
> consoleText . Wasn't timing out as frequently as it was till 12 July. But
> since 27 July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>

One of the failed regression-test-burn in was an actual failure,not a
timeout.
https://build.gluster.org/job/regression-test-burn-in/4049

The brick disconnects from glusterd:

[2018-07-27 16:28:42.882668] I [MSGID: 106005]
[glusterd-handler.c:6129:__glusterd_brick_rpc_notify] 0-management: Brick
builder103.cloud.gluster.org:/d/backends/vol01/brick0 has disconnected from
glusterd.
[2018-07-27 16:28:42.891031] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0*-pmap: removing brick
/d/backends/vol01/brick0 on port 49152*
[2018-07-27 16:28:42.892379] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick (null) on
port 49152
[2018-07-27 16:29:02.636027]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS
--attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org
--volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++

So the client cannot connect to the bricks after this as it never gets the
port info from glusterd. From mnt-glusterfs-vol20.log:

[2018-07-27 16:29:02.769947] I [MSGID: 114020] [client.c:2329:notify]
0-patchy-vol20-client-1: parent translators are ready, attempting connect
on transport
[2018-07-27 16:29:02.770677] E [MSGID: 114058]
[client-handshake.c:1518:client_query_portmap_cbk]
0-patchy-vol20-client-0: *failed
to get the port number for remote subvolume. Please run 'gluster volume
status' on server to see if brick process is running*.
[2018-07-27 16:29:02.770767] I [MSGID: 114018]
[client.c:2255:client_rpc_notify] 0-patchy-vol20-client-0: disconnected
from patchy-vol20-client-0. Client process will keep trying to connect to
glusterd until brick's port is available

>From the brick logs:
[2018-07-27 16:28:34.729241] I [login.c:111:gf_auth] 0-auth/login: allowed
user names: 2b65c380-392e-459f-b722-c130aac29377
[2018-07-27 16:28:34.945474] I [MSGID: 115029]
[server-handshake.c:786:server_setvolume] 0-patchy-vol01-server: accepted
client from
CTX_ID:72dcd65e-2125-4a79-8331-48c0fe9abce7-GRAPH_ID:0-PID:8483-HOST:builder103.cloud.gluster.org-PC_NAME:patchy-vol06-client-2-RECON_NO:-0
(version: 4.2dev)
[2018-07-27 16:28:35.946588] I [MSGID: 101016]
[glusterfs3.h:739:dict_to_xdr] 0-dict: key 'glusterfs.xattrop_index_gfid'
is would not be sent on wire in future [Invalid argument]  *  <--- Last
Brick Log. It looks like the brick went down at this point.*
[2018-07-27 16:29:02.636027]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS
--attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org
--volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++
[2018-07-27 16:29:12.021827]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 83 dd
if=/dev/zero of=/mnt/glusterfs/vol20/a_file bs=4k count=1 ++++++++++
[2018-07-27 16:29:12.039248]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 87 killall
-9 glusterd ++++++++++
[2018-07-27 16:29:17.073995]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 89 killall
-9 glusterfsd ++++++++++
[2018-07-27 16:29:22.096385]:++++++++++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 95 glusterd
++++++++++
[2018-07-27 16:29:24.481555] I [MSGID: 100030] [glusterfsd.c:2728:main]
0-/build/install/sbin/glusterfsd: Started running
/build/install/sbin/glusterfsd version 4.2dev (args:
/build/install/sbin/glusterfsd -s builder103.cloud.gluster.org --volfile-id
patchy-vol01.builder103.cloud.gluster.org.d-backends-vol01-brick0 -p
/var/run/gluster/vols/patchy-vol01/builder103.cloud.gluster.org-d-backends-vol01-brick0.pid
-S /var/run/gluster/f4d6c8f7c3f85b18.socket --brick-name
/d/backends/vol01/brick0 -l
/var/log/glusterfs/bricks/d-backends-vol01-brick0.log --xlator-option
*-posix.glusterd-uuid=0db25f79-8880-4f2d-b1e8-584e751ff0b9 --process-name
brick --brick-port 49153 --xlator-option
patchy-vol01-server.listen-port=49153)

>From /var/log/messages:
*Jul 27 16:28:42 builder103 kernel: [ 2902]     0  2902  3777638   200036
  2322        0             0 glusterfsd*
...
*Jul 27 16:28:42 builder103 kernel: Out of memory: Kill process 2902
(glusterfsd) score 418 or sacrifice child*
*Jul 27 16:28:42 builder103 kernel: Killed process 2902 (glusterfsd)
total-vm:15110552kB, anon-rss:800144kB, file-rss:0kB, shmem-rss:0kB*
*Jul 27 16:30:01 builder103 systemd: Created slice User Slice of root. *

Possible OOM kill?

Regards,
Nithya

> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/814/console) -  Test fails only in brick-mux mode, AI on Atin
> to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
> - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2&start_
> date=2018-06-30&end_date=2018-07-31&branch=all. Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
> job/regression-test-with-multiplex/812/console) - Hasn't failed after 26
> July and earlier it was failing regularly. Did we fix this test through any
> patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
> job/regression-test-with-multiplex/811/console)  - Hasn't failed after 27
> July and earlier it was failing regularly. Did we fix this test through any
> patch (Mohit?)
>
> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
> not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> ============================================================
> ============================================================
> =================================================
> Fails for non-brick mux case too
> ============================================================
> ============================================================
> =================================================
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
> very often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
> There's an email in gluster-devel and a BZ 1610240 for the same.
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
> - seems to be a new failure, however seen this for a non-brick-mux case too
> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
> . Need some eyes from AFR folks.
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/392?state=2&start_
> date=2018-06-30&end_date=2018-07-31&branch=all . We need help from
> geo-rep dev to root cause this earlier than later
>
> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/393?state=2&start_
> date=2018-06-30&end_date=2018-07-31&branch=all . We need help from
> geo-rep dev to root cause this earlier than later
>
> tests/bugs/glusterd/validating-server-quorum.t (https://build.gluster.org/
> job/regression-test-with-multiplex/810/console) - Fails for non-brick-mux
> cases too, https://fstat.gluster.org/failure/580?state=2&start_
> date=2018-06-30&end_date=2018-07-31&branch=all .  Atin has a patch
> https://review.gluster.org/20584 which resolves it but patch is failing
> regression for a different test which is unrelated.
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/809/console) - fails for non brick mux case too -
> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
> Need some eyes from AFR folks.
>
> _______________________________________________
> maintainers mailing list
> maintainers at gluster.org
> https://lists.gluster.org/mailman/listinfo/maintainers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20180803/5e28bbdc/attachment.html>