[Bugs] [Bug 1709959] Gluster causing Kubernetes containers to enter crash loop with 'mkdir ... file exists' error message

Thu May 16 19:21:02 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1709959

--- Comment #7 from Jeff Bischoff <jeff.bischoff at turbonomic.com> ---
Update: Looking at the logs chronologically, I first see a failure in the brick
and then a few seconds later the volume shuts down (we have only one brick per
volume):

>From the Brick log
------------------
[2019-05-08 13:48:33.642605] W [MSGID: 113075]
[posix-helpers.c:1895:posix_fs_health_check] 0-heketidbstorage-posix:
aio_write() on
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a16f9f0374fe5db948a60a017a3f5e60/brick/.glusterfs/health_check
returned [Resource temporarily unavailable]
[2019-05-08 13:48:33.749246] M [MSGID: 113075]
[posix-helpers.c:1962:posix_health_check_thread_proc] 0-heketidbstorage-posix:
health-check failed, going down
[2019-05-08 13:48:34.000428] M [MSGID: 113075]
[posix-helpers.c:1981:posix_health_check_thread_proc] 0-heketidbstorage-posix:
still alive! -> SIGTERM
[2019-05-08 13:49:04.597061] W [glusterfsd.c:1514:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7f16fdd94dd5]
-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x556e53da2d65]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x556e53da2b8b] ) 0-: received
signum (15), shutting down

>From the GlusterD log
---------------------
[2019-05-08 13:49:04.673536] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a16f9f0374fe5db948a60a017a3f5e60/brick
on port
 49152
[2019-05-08 13:49:05.003848] W [socket.c:599:__socket_rwv] 0-management: readv
on /var/run/gluster/fe4ac75011a4de0e.socket failed (No data available)

This same pattern repeats for all the bricks/volumes. Most of them go offline
within a second of the first one. The stragglers go offline within the next 30
minutes.

My interpretation of these logs is that the socket gluster is using times out.
Do I need to increase 'network.ping-timeout' or 'client.grace-timeout'to
address this?

What really boggles me is why the brick stays offline after the timeout. After
all, it is only a "temporarily" unavailable resource. Shouldn't Gluster be able
to recover from this error without user intervention?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.