<div dir="ltr">Hi,<div><br></div><div>A little while ago we had an issue with Gluster 6. As it was urgent we downgraded to Gluster 5.9 and it went away.</div><div><br>Some boxes are now running 5.10 and the issue has come back.</div><div><br></div><div>From the operators point of view, the first you know about this is getting reports that the transport endpoint is not connected:</div><div><br></div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)">OSError: [Errno 107] Transport endpoint is not connected: &#39;/shared/lfd/benfusetestlfd&#39;</pre></div><div><br></div><div>If we check, we can see that the brick process has died</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)"># gluster volume status

Status of volume: shared

Gluster process                             TCP Port  RDMA Port  Online  Pid

------------------------------------------------------------------------------

Brick fa01.gl:/data1/gluster                N/A       N/A        N       N/A  

Brick fa02.gl:/data1/gluster                N/A       N/A        N       N/A  

Brick fa01.gl:/data2/gluster                49153     0          Y       14136

Brick fa02.gl:/data2/gluster                49153     0          Y       14154

NFS Server on localhost                     N/A       N/A        N       N/A  

Self-heal Daemon on localhost               N/A       N/A        Y       186193

NFS Server on <a href="http://fa01.gl">fa01.gl</a>                       N/A       N/A        N       N/A  

Self-heal Daemon on <a href="http://fa01.gl">fa01.gl</a>                 N/A       N/A        Y       6723  </pre></div><div><br></div><div>Looking in the brick logs, we can see that the process crashed, and we get a backtrace (of sorts)</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)">&gt;gen=110, slot-&gt;fd=17

pending frames:

patchset: git://<a href="http://git.gluster.org/glusterfs.git">git.gluster.org/glusterfs.git</a>

signal received: 11

time of crash: 

2019-07-04 09:42:43

configuration details:

argp 1

backtrace 1

dlfcn 1

libpthread 1

llistxattr 1

setfsid 1

spinlock 1

epoll.h 1

xattr.h 1

st_atim.tv_nsec 1

package-string: glusterfs 6.1

/lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]

/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]

/lib64/libc.so.6(+0x36280)[0x7f7996b2a280]

/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]

/lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]

/lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]

/lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]

</pre><br style="color:rgb(0,0,0);font-family:monospace;font-size:medium"></div><div>Other than that, there&#39;s not a lot in the logs. In syslog we can see the client (Gluster&#39;s FS is mounted on the boxes) complaining that the brick&#39;s gone away.</div><div><br></div><div>Software versions (for when this was happening with 6):</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)"># rpm -qa | grep glus

glusterfs-libs-6.1-1.el7.x86_64

glusterfs-cli-6.1-1.el7.x86_64

centos-release-gluster6-1.0-1.el7.centos.noarch

glusterfs-6.1-1.el7.x86_64

glusterfs-api-6.1-1.el7.x86_64

glusterfs-server-6.1-1.el7.x86_64

glusterfs-client-xlators-6.1-1.el7.x86_64

glusterfs-fuse-6.1-1.el7.x86_64</pre></div><div><br></div><div>This was happening pretty regularly (uncomfortably so) on boxes running Gluster 6. Grepping through the brick logs it&#39;s always a segfault or sigabrt that leads to brick death</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)"># grep &quot;signal received:&quot; data*

data1-gluster.log:signal received: 11

data1-gluster.log:signal received: 6

data1-gluster.log:signal received: 6

data1-gluster.log:signal received: 11

data2-gluster.log:signal received: 6</pre></div><div>There&#39;s no apparent correlation on times or usage levels that we could see. The issue was occurring on a wide array of hardware, spread across the globe (but always talking to local - i.e. LAN - peers). All the same, disks were checked, RAM checked etc.<br></div><div><br></div><div>Digging through the logs we were able to find the lines just as the crash occurs</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)">[2019-07-07 06:37:00.213490] I [MSGID: 108031] [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting local read_child shared-client-2

[2019-07-07 06:37:03.544248] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain observed. [Input/output error]

[2019-07-07 06:37:03.544312] W [MSGID: 0] [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume shared-replicate-1 returned -1

[2019-07-07 06:37:03.545317] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain observed. [Input/output error]

[2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk] 0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs =&gt; -1 (Input/output error)</pre></div><div>But, it&#39;s not the first time that had occurred, so may be completely unrelated.</div><div><br></div><div>When this happens, restarting gluster buys some time. It may just be coincidental, but our searches through the logs showed <i>only</i> the first brick process dying, processes for other bricks (some of the boxes have 4) don&#39;t appear to be affected by this.</div><div><br></div><div>As we had lots and lots of Gluster machines failing across the network, at this point we stopped investigating and I came up with a downgrade procedure so that we could get production back into a usable state. Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue just went away. Unfortunately other demands came up, so no-one was able to follow up on it.</div><div><br></div><div>Tonight though, there&#39;s been a brick process fail on a 5.10 machine with an all too familiar looking BT</div><div><br></div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)">[2019-12-10 17:20:01.708601] I [MSGID: 115029] [server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 (version: 5.1

0)

[2019-12-10 17:20:01.745940] I [MSGID: 115036] [server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0

[2019-12-10 17:20:01.746090] I [MSGID: 101055] [client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0

pending frames:

patchset: git://<a href="http://git.gluster.org/glusterfs.git">git.gluster.org/glusterfs.git</a>

signal received: 11

time of crash: 

2019-12-10 17:21:36

configuration details:

argp 1

backtrace 1

dlfcn 1

libpthread 1

llistxattr 1

setfsid 1

spinlock 1

epoll.h 1

xattr.h 1

st_atim.tv_nsec 1

package-string: glusterfs 5.10

/lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]

/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]

/lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]

/usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]

/lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]

/lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]

/lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]

---------</pre></div><div><br></div><div>Versions this time are</div><div><pre style="border-radius:3px;margin-top:9px;margin-bottom:9px;border:1px solid rgb(204,204,204);background:rgb(245,245,245) none repeat scroll 0% 0%;font-size:12px;line-height:1.33333;padding:9px 12px;word-break:normal;max-height:30em;overflow:auto;color:rgb(0,0,0)"># rpm -qa | grep glus

glusterfs-server-5.10-1.el7.x86_64

centos-release-gluster5-1.0-1.el7.centos.noarch

glusterfs-fuse-5.10-1.el7.x86_64

glusterfs-libs-5.10-1.el7.x86_64

glusterfs-client-xlators-5.10-1.el7.x86_64

glusterfs-api-5.10-1.el7.x86_64

glusterfs-5.10-1.el7.x86_64

glusterfs-cli-5.10-1.el7.x86_64</pre><br></div><div>These boxes have been running 5.10 for less than 48 hours</div><div><br></div><div>Has anyone else run into this? Assuming the root is the same (it&#39;s a fairly limited BT, so hard to say for sure), was something from 6 backported into 5.10?</div><div><br></div><div>Thanks</div><div><br></div><div>Ben</div></div>