[Bugs] [Bug 1709959] Gluster causing Kubernetes containers to enter crash loop with 'mkdir ... file exists' error message

Tue May 14 19:09:21 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1709959

--- Comment #5 from Jeff Bischoff <jeff.bischoff at turbonomic.com> ---
Looking at the glusterd.log, it seems like everything was running for over a
day with no log messages, when suddenly we hit this:

    got disconnect from stale rpc on
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick`

Here's the context for that snippet. The lines from 05/06 were during brick
startup, while the lines from 05/07 are when the problem started.

====
    [2019-05-06 02:18:00.292652] I [glusterd-utils.c:6090:glusterd_brick_start]
0-management: starting a fresh brick process for brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
    The message "W [MSGID: 101095] [xlator.c:181:xlator_volopt_dynload]
0-xlator: /usr/lib64/glusterfs/4.1.7/xlator/nfs/server.so: cannot open shared
object file: No such file or directory" repeated 12 times between [2019-05-06
02:17:49.214270] and [2019-05-06 02:17:59.537241]
    [2019-05-06 02:18:00.474120] I [MSGID: 106142]
[glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
on port 49169
    [2019-05-06 02:18:00.477708] I [rpc-clnt.c:1059:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
    [2019-05-06 02:18:00.507596] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
    [2019-05-06 02:18:00.507662] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is
stopped
    [2019-05-06 02:18:00.507682] I [MSGID: 106599]
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed
    [2019-05-06 02:18:00.511313] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
    [2019-05-06 02:18:00.511386] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is
stopped
    [2019-05-06 02:18:00.513396] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already
stopped
    [2019-05-06 02:18:00.513503] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is
stopped
    [2019-05-06 02:18:00.534304] I [run.c:241:runner_log]
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a)
[0x7f795f17fc9a]
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765)
[0x7f795f17f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5]
) 0-management: Ran script:
/var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --first=no --version=1
--volume-op=start --gd-workdir=/var/lib/glusterd
    [2019-05-06 02:18:00.582971] E [run.c:241:runner_log]
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a)
[0x7f795f17fc9a]
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3)
[0x7f795f17f6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5]
) 0-management: Failed to execute script:
/var/lib/glusterd/hooks/1/start/post/S30samba-start.sh
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --first=no --version=1
--volume-op=start --gd-workdir=/var/lib/glusterd
    The message "W [MSGID: 101095] [xlator.c:452:xlator_dynload] 0-xlator:
/usr/lib64/glusterfs/4.1.7/xlator/nfs/server.so: cannot open shared object
file: No such file or directory" repeated 76 times between [2019-05-06
02:16:52.212662] and [2019-05-06 02:17:58.606533]
    [2019-05-07 11:53:38.663362] I [run.c:241:runner_log]
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5)
[0x7f795f0d77a5]
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765)
[0x7f795f17f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5]
) 0-management: Ran script:
/var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --last=no
    [2019-05-07 11:53:38.905338] E [run.c:241:runner_log]
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5)
[0x7f795f0d77a5]
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3)
[0x7f795f17f6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5]
) 0-management: Failed to execute script:
/var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --last=no
    [2019-05-07 11:53:38.982785] I [MSGID: 106542]
[glusterd-utils.c:8253:glusterd_brick_signal] 0-glusterd: sending signal 15 to
brick with pid 8951
    [2019-05-07 11:53:39.983244] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
on port 49169
    [2019-05-07 11:53:39.984656] W
[glusterd-handler.c:6124:__glusterd_brick_rpc_notify] 0-management: got
disconnect from stale rpc on
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
    [2019-05-07 11:53:40.316466] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
    [2019-05-07 11:53:40.316601] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is
stopped
    [2019-05-07 11:53:40.316644] I [MSGID: 106599]
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed
    [2019-05-07 11:53:40.319650] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
    [2019-05-07 11:53:40.319708] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is
stopped
    [2019-05-07 11:53:40.321091] I [MSGID: 106131]
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already
stopped
    [2019-05-07 11:53:40.321132] I [MSGID: 106568]
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is
stopped
====

What would cause it to go stale? What is actually going stale here? Where
should I look next? I am using whatever is built-in to gluster-centos:latest
image from dockerhub.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.