[Gluster-users] Gluster Periodic Brick Process Deaths

Mon Jan 13 10:57:23 UTC 2020

Hi,

Just an update on this - we made our ACLs much, much stricter around
gluster ports and to my knowledge haven't seen a brick death since.

Ben

On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker <btasker at swiftserve.com> wrote:

> Hi Xavi,
>
> We don't that I'm explicitly aware of, *but* I can't rule it out as a
> probability as it's possible some of our partners do (some/most certainly
> have scans done as part of pentests fairly regularly).
>
> But, that does at least give me an avenue to pursue in the meantime,
> thanks!
>
> Ben
>
> On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jahernan at redhat.com>
> wrote:
>
>> Hi Ben,
>>
>> I've recently seen some issues that seem similar to yours (based on the
>> stack trace in the logs). Right now it seems that in these cases the
>> problem is caused by some port scanning tool that triggers an unhandled
>> condition. We are still investigating what is causing this to fix it as
>> soon as possible.
>>
>> Do you have one of these tools on your network ?
>>
>> Regards,
>>
>> Xavi
>>
>> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btasker at swiftserve.com>
>> wrote:
>>
>>> Hi,
>>>
>>> A little while ago we had an issue with Gluster 6. As it was urgent we
>>> downgraded to Gluster 5.9 and it went away.
>>>
>>> Some boxes are now running 5.10 and the issue has come back.
>>>
>>> From the operators point of view, the first you know about this is
>>> getting reports that the transport endpoint is not connected:
>>>
>>> OSError: [Errno 107] Transport endpoint is not connected: '/shared/lfd/benfusetestlfd'
>>>
>>>
>>> If we check, we can see that the brick process has died
>>>
>>> # gluster volume status
>>> Status of volume: shared
>>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>>> ------------------------------------------------------------------------------
>>> Brick fa01.gl:/data1/gluster                N/A       N/A        N       N/A
>>> Brick fa02.gl:/data1/gluster                N/A       N/A        N       N/A
>>> Brick fa01.gl:/data2/gluster                49153     0          Y       14136
>>> Brick fa02.gl:/data2/gluster                49153     0          Y       14154
>>> NFS Server on localhost                     N/A       N/A        N       N/A
>>> Self-heal Daemon on localhost               N/A       N/A        Y       186193
>>> NFS Server on fa01.gl                       N/A       N/A        N       N/A
>>> Self-heal Daemon on fa01.gl                 N/A       N/A        Y       6723
>>>
>>>
>>> Looking in the brick logs, we can see that the process crashed, and we
>>> get a backtrace (of sorts)
>>>
>>> >gen=110, slot->fd=17
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-07-04 09:42:43
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 6.1
>>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>>> /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>>
>>>
>>> Other than that, there's not a lot in the logs. In syslog we can see the
>>> client (Gluster's FS is mounted on the boxes) complaining that the brick's
>>> gone away.
>>>
>>> Software versions (for when this was happening with 6):
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-libs-6.1-1.el7.x86_64
>>> glusterfs-cli-6.1-1.el7.x86_64
>>> centos-release-gluster6-1.0-1.el7.centos.noarch
>>> glusterfs-6.1-1.el7.x86_64
>>> glusterfs-api-6.1-1.el7.x86_64
>>> glusterfs-server-6.1-1.el7.x86_64
>>> glusterfs-client-xlators-6.1-1.el7.x86_64
>>> glusterfs-fuse-6.1-1.el7.x86_64
>>>
>>>
>>> This was happening pretty regularly (uncomfortably so) on boxes running
>>> Gluster 6. Grepping through the brick logs it's always a segfault or
>>> sigabrt that leads to brick death
>>>
>>> # grep "signal received:" data*
>>> data1-gluster.log:signal received: 11
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 11
>>> data2-gluster.log:signal received: 6
>>>
>>> There's no apparent correlation on times or usage levels that we could
>>> see. The issue was occurring on a wide array of hardware, spread across the
>>> globe (but always talking to local - i.e. LAN - peers). All the same, disks
>>> were checked, RAM checked etc.
>>>
>>> Digging through the logs we were able to find the lines just as the
>>> crash occurs
>>>
>>> [2019-07-07 06:37:00.213490] I [MSGID: 108031] [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting local read_child shared-client-2
>>> [2019-07-07 06:37:03.544248] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain observed. [Input/output error]
>>> [2019-07-07 06:37:03.544312] W [MSGID: 0] [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume shared-replicate-1 returned -1
>>> [2019-07-07 06:37:03.545317] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain observed. [Input/output error]
>>> [2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk] 0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1 (Input/output error)
>>>
>>> But, it's not the first time that had occurred, so may be completely
>>> unrelated.
>>>
>>> When this happens, restarting gluster buys some time. It may just be
>>> coincidental, but our searches through the logs showed *only* the first
>>> brick process dying, processes for other bricks (some of the boxes have 4)
>>> don't appear to be affected by this.
>>>
>>> As we had lots and lots of Gluster machines failing across the network,
>>> at this point we stopped investigating and I came up with a downgrade
>>> procedure so that we could get production back into a usable state.
>>> Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue
>>> just went away. Unfortunately other demands came up, so no-one was able to
>>> follow up on it.
>>>
>>> Tonight though, there's been a brick process fail on a 5.10 machine with
>>> an all too familiar looking BT
>>>
>>> [2019-12-10 17:20:01.708601] I [MSGID: 115029] [server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 (version: 5.1
>>> 0)
>>> [2019-12-10 17:20:01.745940] I [MSGID: 115036] [server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> [2019-12-10 17:20:01.746090] I [MSGID: 101055] [client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-12-10 17:21:36
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 5.10
>>> /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]
>>> /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]
>>> /usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]
>>> /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]
>>> /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]
>>> ---------
>>>
>>>
>>> Versions this time are
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-server-5.10-1.el7.x86_64
>>> centos-release-gluster5-1.0-1.el7.centos.noarch
>>> glusterfs-fuse-5.10-1.el7.x86_64
>>> glusterfs-libs-5.10-1.el7.x86_64
>>> glusterfs-client-xlators-5.10-1.el7.x86_64
>>> glusterfs-api-5.10-1.el7.x86_64
>>> glusterfs-5.10-1.el7.x86_64
>>> glusterfs-cli-5.10-1.el7.x86_64
>>>
>>>
>>> These boxes have been running 5.10 for less than 48 hours
>>>
>>> Has anyone else run into this? Assuming the root is the same (it's a
>>> fairly limited BT, so hard to say for sure), was something from 6
>>> backported into 5.10?
>>>
>>> Thanks
>>>
>>> Ben
>>> ________
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200113/14a4ffd2/attachment.html>