[Gluster-users] Gluster Periodic Brick Process Deaths

Ben Tasker btasker at swiftserve.com
Tue Dec 10 18:52:58 UTC 2019


Hi,

A little while ago we had an issue with Gluster 6. As it was urgent we
downgraded to Gluster 5.9 and it went away.

Some boxes are now running 5.10 and the issue has come back.

>From the operators point of view, the first you know about this is getting
reports that the transport endpoint is not connected:

OSError: [Errno 107] Transport endpoint is not connected:
'/shared/lfd/benfusetestlfd'


If we check, we can see that the brick process has died

# gluster volume status
Status of volume: shared
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick fa01.gl:/data1/gluster                N/A       N/A        N       N/A
Brick fa02.gl:/data1/gluster                N/A       N/A        N       N/A
Brick fa01.gl:/data2/gluster                49153     0          Y       14136
Brick fa02.gl:/data2/gluster                49153     0          Y       14154
NFS Server on localhost                     N/A       N/A        N       N/A
Self-heal Daemon on localhost               N/A       N/A        Y       186193
NFS Server on fa01.gl                       N/A       N/A        N       N/A
Self-heal Daemon on fa01.gl                 N/A       N/A        Y       6723


Looking in the brick logs, we can see that the process crashed, and we get
a backtrace (of sorts)

>gen=110, slot->fd=17
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-07-04 09:42:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
/lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
/lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]


Other than that, there's not a lot in the logs. In syslog we can see the
client (Gluster's FS is mounted on the boxes) complaining that the brick's
gone away.

Software versions (for when this was happening with 6):

# rpm -qa | grep glus
glusterfs-libs-6.1-1.el7.x86_64
glusterfs-cli-6.1-1.el7.x86_64
centos-release-gluster6-1.0-1.el7.centos.noarch
glusterfs-6.1-1.el7.x86_64
glusterfs-api-6.1-1.el7.x86_64
glusterfs-server-6.1-1.el7.x86_64
glusterfs-client-xlators-6.1-1.el7.x86_64
glusterfs-fuse-6.1-1.el7.x86_64


This was happening pretty regularly (uncomfortably so) on boxes running
Gluster 6. Grepping through the brick logs it's always a segfault or
sigabrt that leads to brick death

# grep "signal received:" data*
data1-gluster.log:signal received: 11
data1-gluster.log:signal received: 6
data1-gluster.log:signal received: 6
data1-gluster.log:signal received: 11
data2-gluster.log:signal received: 6

There's no apparent correlation on times or usage levels that we could see.
The issue was occurring on a wide array of hardware, spread across the
globe (but always talking to local - i.e. LAN - peers). All the same, disks
were checked, RAM checked etc.

Digging through the logs we were able to find the lines just as the crash
occurs

[2019-07-07 06:37:00.213490] I [MSGID: 108031]
[afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1:
selecting local read_child shared-client-2
[2019-07-07 06:37:03.544248] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done]
0-shared-replicate-1: Failing SETATTR on gfid
a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain observed.
[Input/output error]
[2019-07-07 06:37:03.544312] W [MSGID: 0]
[dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht:
subvolume shared-replicate-1 returned -1
[2019-07-07 06:37:03.545317] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done]
0-shared-replicate-1: Failing SETATTR on gfid
a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain observed.
[Input/output error]
[2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk]
0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1
(Input/output error)

But, it's not the first time that had occurred, so may be completely
unrelated.

When this happens, restarting gluster buys some time. It may just be
coincidental, but our searches through the logs showed *only* the first
brick process dying, processes for other bricks (some of the boxes have 4)
don't appear to be affected by this.

As we had lots and lots of Gluster machines failing across the network, at
this point we stopped investigating and I came up with a downgrade
procedure so that we could get production back into a usable state.
Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue
just went away. Unfortunately other demands came up, so no-one was able to
follow up on it.

Tonight though, there's been a brick process fail on a 5.10 machine with an
all too familiar looking BT

[2019-12-10 17:20:01.708601] I [MSGID: 115029]
[server-handshake.c:537:server_setvolume] 0-shared-server: accepted
client from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
(version: 5.1
0)
[2019-12-10 17:20:01.745940] I [MSGID: 115036]
[server.c:469:server_rpc_notify] 0-shared-server: disconnecting
connection from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
[2019-12-10 17:20:01.746090] I [MSGID: 101055]
[client_t.c:435:gf_client_unref] 0-shared-server: Shutting down
connection CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-12-10 17:21:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.10
/lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]
/lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]
/usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]
/lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]
/lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]
/lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]
---------


Versions this time are

# rpm -qa | grep glus
glusterfs-server-5.10-1.el7.x86_64
centos-release-gluster5-1.0-1.el7.centos.noarch
glusterfs-fuse-5.10-1.el7.x86_64
glusterfs-libs-5.10-1.el7.x86_64
glusterfs-client-xlators-5.10-1.el7.x86_64
glusterfs-api-5.10-1.el7.x86_64
glusterfs-5.10-1.el7.x86_64
glusterfs-cli-5.10-1.el7.x86_64


These boxes have been running 5.10 for less than 48 hours

Has anyone else run into this? Assuming the root is the same (it's a fairly
limited BT, so hard to say for sure), was something from 6 backported into
5.10?

Thanks

Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191210/138330ed/attachment.html>


More information about the Gluster-users mailing list