[Bugs] [Bug 1720733] New: glusterfs 4.1.7 client crash

Fri Jun 14 17:50:10 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1720733

            Bug ID: 1720733
           Summary: glusterfs 4.1.7 client crash
           Product: GlusterFS
           Version: 4.1
                OS: Linux
            Status: NEW
         Component: libglusterfsclient
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: danny.lee at appian.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1580779
  --> https://bugzilla.redhat.com/attachment.cgi?id=1580779&action=edit
Gluster Client Log

Description of problem:
During a period of a large write, a 42 second disconnect error occurred in the
logs. This occurs from time to time, but recovers.  But this time, about ~10
seconds later, the client/glusterfs crashed.  The error in the client logs was
the following:

[2019-06-11 15:31:42.794126] I [MSGID: 114018]
[client.c:2254:client_rpc_notify] 0-somecompany-client-1: disconnected from
somecompany-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
pending frames:
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-06-11 15:31:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.1.6
/lib64/libglusterfs.so.0(+0x25940)[0x7f66fd4ee940]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f66fd4f88a4]
/lib64/libc.so.6(+0x36280)[0x7f66fbb53280]
/usr/lib64/glusterfs/4.1.6/xlator/protocol/client.so(+0x615e3)[0x7f66f60e35e3]
/lib64/libgfrpc.so.0(+0xec20)[0x7f66fd2bbc20]
/lib64/libgfrpc.so.0(+0xefb3)[0x7f66fd2bbfb3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f66fd2b7e93]
/usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0x7636)[0x7f66f83cb636]
/usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0xa107)[0x7f66f83ce107]
/lib64/libglusterfs.so.0(+0x890c4)[0x7f66fd5520c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f66fc352dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f66fbc1aead]

Version-Release number of selected component (if applicable):
Gluster 4.1.7
Centos 7.6.1810 (Core)

How reproducible:
Not really sure, but we believe it has something to do with a very large write
(~1-3GBs).  During that time, either the IO or the network was busy, causing
the 42 second disconnect.

This was a 3-brick setup with one of the bricks being an arbiter brick. The
primary EC2 instance had one of the data bricks and an arbiter brick and the
secondary had just one of the data bricks. Both had a FUSE-client mount that
connected to the the volume.

The primary server was the one doing the large write at the time, and the
primary's glusterfs client was the client that crashed, in which we could not
access the files in the mount (Transport endpoint is not connected). The
secondary's glusterfs client was still able to access the files.  "gluster
volume status" showed that all the bricks were up and running.

We were able to unmount and mount the client later, but at that point, we were
unsure if the services using the mount were using stale file pointers, so we
restarted the servers to make sure everything was okay. Sadly, the coredump was
corrupted and was not recoverable (unrelated).

Steps to Reproduce:
1. N/A

Actual results:
Client glusterfs process crashed and did not recover, so we were unable to
access the files on the mount

Expected results:
Client glusterfs process does not crash, so that we are able to access the
files on the mount.  Or it crashes and there is a way to recover the mount
without having to remount.

Additional info:
Servers have been up for a few weeks with similar load, but have had no issues
until now.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.