[Bugs] [Bug 1783227] New: GlusterFS brick process crash

Fri Dec 13 11:38:06 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1783227

            Bug ID: 1783227
           Summary: GlusterFS brick process crash
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: rpc
          Keywords: Triaged
          Severity: urgent
          Priority: urgent
          Assignee: bugs at gluster.org
          Reporter: jahernan at redhat.com
                CC: bugs at gluster.org, jahernan at redhat.com,
                    joe.chan at neuralt.com, pasik at iki.fi
        Depends On: 1782495
            Blocks: 1739884, 1749625
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1782495 +++

+++ This bug was initially created as a clone of Bug #1749625 +++

Description of problem:
The gluster client is somehow crash after running for a period (1~2 weeks), I
would like to know is it a bug exist in GlusterFS 6.1 or other problem. 

Firstly, The glusterfs is installed by using the rpm from
http://mirror.centos.org/centos/7/storage/x86_64/gluster-4.0/

client rpm:
glusterfs-6.1-1.el7.x86_64.rpm

server rpm:
glusterfs-server-6.1-1.el7.x86_64.rpm

Below is the structure of the GlusterFS in our servers:

Volume Name: k8s-volume
Type: Replicate
Volume ID: d5d673d6-a1bb-4d14-bc91-1ceab7ad761d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hk2hp057:/gfs/k8s_pv
Brick2: hk2hp058:/gfs/k8s_pv
Brick3: hk2hp059:/gfs/k8s_pv
Options Reconfigured:
auth.reject: 172.20.117.144
auth.allow: 172.20.117.*
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

As shown above, there are total three servers using the "Replicated Volume". It
is created three replicas in pursuit of better reliability and data redundancy
with regard to the official recommendation. 

Version-Release number of selected component (if applicable):

However, the gluster process go down after running for a period. it is easily
to see by executing the "gluster" command - $ gluster volume status

Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  N/A       N/A        N       N/A
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       6065
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

the brick under "hk2hp058" is offline.
When i check the log file under "/var/log/glusterfs/bricks/", the message as
below:

[2019-09-04 07:33:50.004661] W [MSGID: 113117]
[posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata
failed, No ctime :
/gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1
gfid:dd56eca4-e7e8-4208-b589-763a428408e1
[2019-09-04 07:33:50.004776] W [MSGID: 113117]
[posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata
failed, No ctime :
/gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1
gfid:dd56eca4-e7e8-4208-b589-763a428408e1
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-09-04 10:53:02
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7fa1fa2d1db0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fa1fa2dc7b4]
/lib64/libc.so.6(+0x36340)[0x7fa1f8911340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7fa1ee6964cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7fa1fa337286]
/lib64/libpthread.so.0(+0x7dd5)[0x7fa1f9111dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fa1f89d902d]
---------

The glusterfs process is resumed after manually restart the process -
"glusterd".
[root at hk2hp058 /var/log/glusterfs/bricks] systemctl restart glusterd
[root at hk2hp058 /var/log/glusterfs/bricks] gluster volume status
Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  49152     0          Y       18814
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       18846
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

Thus, is it a bug or other reason lead to this issue?

--- Additional comment from  on 2019-09-06 05:07:07 CEST ---

--- Additional comment from  on 2019-09-06 05:07:35 CEST ---

--- Additional comment from Nithya Balachandran on 2019-09-09 05:24:46 CEST ---

Do you have a coredump? If yes, please install the debuginfo, and send send the
bt for the core.

--- Additional comment from  on 2019-09-09 06:03:38 CEST ---

Thank you for your reply.
Unfortunately, I cannot find any coredump file in the servers with pattern
"core.*"

--- Additional comment from Nithya Balachandran on 2019-09-12 05:12:40 CEST ---

In that case please use the steps in
https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-backtrace.sh
and the backtrace in the  log file to get more information as to what happened.

--- Additional comment from  on 2019-09-12 08:46:08 CEST ---

(In reply to Nithya Balachandran from comment #5)
> In that case please use the steps in
> https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> backtrace.sh and the backtrace in the  log file to get more information as
> to what happened.

Thank you for your suggestion, however, I have encountered the error while
running the corresponding script. Is there something wrong?

[hk2hp060 at projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh
/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm
glusterfs/bricks/20190912.log
eu-addr2line --functions
--exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug
/usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd
eu-addr2line --functions
--exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug
/usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd

--- Additional comment from  on 2019-09-12 08:52:57 CEST ---

(In reply to joe.chan from comment #6)
> (In reply to Nithya Balachandran from comment #5)
> > In that case please use the steps in
> > https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> > backtrace.sh and the backtrace in the  log file to get more information as
> > to what happened.
> 
> Thank you for your suggestion, however, I have encountered the error while
> running the corresponding script. Is there something wrong?
> 
> [hk2hp060 at projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh
> /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm
> glusterfs/bricks/20190912.log
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd

Sorry, it is a mistake. Below is the backtrace in log file:

[hk2hp060 at projadm /home/projadm/user/joechan/gfs]  ./print-backtrace.sh
/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm
glusterfs/bricks/20190912.log
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc]
socket_is_connected inlined at
/usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in
socket_event_handler
/usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

--- Additional comment from Nithya Balachandran on 2019-09-12 09:55:33 CEST ---

Can you confirm that you have copied the stacktrace alone to a different file
and run the script on thta file? You should be seeing one line per line in the
stacktrace.

--- Additional comment from  on 2019-09-12 10:10:09 CEST ---

(In reply to Nithya Balachandran from comment #8)
> Can you confirm that you have copied the stacktrace alone to a different
> file and run the script on thta file? You should be seeing one line per line
> in the stacktrace.

Yes it is confirmed that the procedure is correct.

stacktrace from log:
/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4]
/lib64/libc.so.6(+0x36340)[0x7f506be5a340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f506c65add5]
/lib64/libc.so.6(clone+0x6d)[0x7f506bf2202d]

only one line while running the script:
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc]
socket_is_connected inlined at
/usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in
socket_event_handler
/usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

--- Additional comment from  on 2019-09-12 10:35:50 CEST ---

attached more debug result by manually execute the "eu-addr2line" command:

/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0] _gf_msg_backtrace_nomem
/usr/src/debug/glusterfs-6.1/libglusterfs/src/logging.c:1124

/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4] sprintf inlined
at /usr/src/debug/glusterfs-6.1/libglusterfs/src/common-utils.c:954 in
gf_print_trace /usr/include/bits/stdio2.h:33

/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286] event_dispatch_epoll_handler
inlined at /usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:761 in
event_dispatch_epoll_worker
/usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:648

I am not able to find the "/lib64/libc.so.6" and "/lib64/libpthread.so.0" under
"glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/"

--- Additional comment from Nithya Balachandran on 2019-09-13 05:46:41 CEST ---

libc and libpthread are std libraries and not part of gluster.

I've moving this to the rpc component for someone to look at.

--- Additional comment from  on 2019-09-23 06:35:12 CEST ---

attached the coredump files

--- Additional comment from  on 2019-09-23 12:17:58 CEST ---

Program terminated with signal 11, Segmentation fault.
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
2619        if (priv->use_ssl) {
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-260.el7_6.6.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.15.1-37.el7_6.x86_64 libacl-2.2.51-14.el7.x86_64
libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64
libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64
libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-59.el7_6.1.x86_64
openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
#1  socket_event_handler (fd=5, idx=11, gen=16, data=0x7f90d80018a0, poll_in=1,
poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at socket.c:2887
#2  0x00007f90ec34c286 in event_dispatch_epoll_handler (event=0x7f90d7ffee70,
event_pool=0x5632b0727560) at event-epoll.c:648
#3  event_dispatch_epoll_worker (data=0x5632b077a420) at event-epoll.c:761
#4  0x00007f90eb126dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f90ea9ee02d in clone () from /lib64/libc.so.6

--- Additional comment from Xavi Hernandez on 2019-12-09 17:16:33 CET ---

@joe, do you have any port scanning tool that could try to probe gluster ports
?

--- Additional comment from  on 2019-12-11 04:41:04 CET ---

(In reply to Xavi Hernandez from comment #14)
> @joe, do you have any port scanning tool that could try to probe gluster
> ports ?

Yes, I have use the security scanner called Rapid7 to ensure the security
hardening of the whole system. I think the root cause of this crash is most
likely due to the scanning as the case can be reproduced.
Thank you.

--- Additional comment from Worker Ant on 2019-12-11 18:37:47 CET ---

REVIEW: https://review.gluster.org/23861 (socket: fix error handling) posted
(#1) for review on master by Xavi Hernandez

--- Additional comment from Xavi Hernandez on 2019-12-11 18:44:36 CET ---

I've found the problem. I've sent a patch to fix it. As soon as it is accepted
I will backport it so that it will be fixed in next release.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1739884
[Bug 1739884] glusterfsd process crashes with SIGSEGV
https://bugzilla.redhat.com/show_bug.cgi?id=1749625
[Bug 1749625] [GlusterFS 6.1] GlusterFS brick process crash
https://bugzilla.redhat.com/show_bug.cgi?id=1782495
[Bug 1782495] GlusterFS brick process crash
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.