[Bugs] [Bug 1303259] New: cyclic NFS daemon crash when stopping a volume with active NFS connections in 3.7.5

Sat Jan 30 00:21:39 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1303259

            Bug ID: 1303259
           Summary: cyclic NFS daemon crash when stopping a volume with
                    active NFS connections in 3.7.5
           Product: GlusterFS
           Version: 3.7.5
         Component: nfs
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kris.laib at nwea.org
                CC: bugs at gluster.org

Description of problem:
=======================
We recently found a reproducible issue in 3.7.5 which causes the NFS service to
get repeatedly taken offline when an in-use volume is stopped.

How reproducible:
=================
100%

Methods of reproducing:
=======================

A) Have an active NFS mount from a Linux client, and while data is being either
read form or written to that mount, issue a "volume stop" on gluster.  To
simulate io, I'm using a simple dd from /dev/zero

B) Similar to A, but instead of having active data movement, simply have a
shell on the client be sitting in the mounted directory.  Once the volume is
stopped, perform an "ls" from the client to trigger the crash.     This only
works if you were already in the mounted directory while the stop was issued.

Actual results:
===============

For either A or B, the NFS service on the gluster node the client was connected
to will continue to crash at X interval (~5min) if manually brought back online
after each crash.  This will continue to occur until the offending hung process
on the client is killed, or the gluster volume is brought back online.

Each time the NFS service crashes, a large core dump is left on the gluster
node in "/" for the NFS host was communicating with.  The dump from this test
was 641MB.

Log information:
===============
(from nfs.log)

[2016-01-29 23:48:58.996528] E [nfs3.c:2303:nfs3_write] 0-nfs-nfsv3: Failed to
map FH to vol: client=10.1.254.125:872,
exportid=d9c54d47-26ed-4305-9650-042d28e79234,
gfid=f38a51a5-9977-4de5-a12b-792b6bfd30a0
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-01-29 23:48:58
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f30494309b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x32f)[0x7f304945051f]
/lib64/libc.so.6(+0x326a0)[0x7f3047dd06a0]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3_write+0x244)[0x7f303b1ea724]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3svc_write+0xbc)[0x7f303b1eab6c]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x7f30491f9f74]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7f30491fa173]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f30491fbb28]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xabd5)[0x7f303df82bd5]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc7bd)[0x7f303df847bd]
/usr/lib64/libglusterfs.so.0(+0x8b180)[0x7f3049496180]
/lib64/libpthread.so.0(+0x7a51)[0x7f304851ca51]
/lib64/libc.so.6(clone+0x6d)[0x7f3047e8693d]
---------

Environment Info:
================

This is a 3 node cluster, node 1 is only for quorum, nodes 2/3 serve data from
1x2 replicated vols.  We utilize CTBD for NFS HA.

This failure has been repeated several times in 2 identically setup clusters in
different datacenters

"ctdb status" and "peer status" show healthy prior to starting the tests

Underlying bricks are XFS, backed by iscsi SAN LUNs, carved up via LVM.

This is reproducible newly created volumes.

(this is the volume I was using when generating the above nfs.log error)
[root at gfs-int02.mgmt ~]$ gluster volume info res_temp

Volume Name: res_temp
Type: Replicate
Volume ID: d9c54d47-26ed-4305-9650-042d28e79234
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-int02.mgmt:/data/glusterfs/res_temp_brick1/brick1
Brick2: gfs-int03.mgmt:/data/glusterfs/res_temp_brick1/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 10.123.12.47,10.1.254.125
performance.readdir-ahead: on
nfs.export-volumes: on
nfs.addr-namelookup: Off
nfs.disable: off
network.ping-timeout: 5
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 51%

[root at gfs-int02.mgmt ~]$ xfs_info /dev/mapper/int-res_temp_brick1 
meta-data=/dev/mapper/int-res_temp_brick1 isize=512    agcount=4,
agsize=25600000 blks
         =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=102400000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=50000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root at gfs-int02.mgmt ~]$ cat /etc/issue
CentOS release 6.7 (Final)
Kernel \r on an \m

[root at gfs-int02.mgmt ~]$ uname -a
Linux gfs-int02.mgmt 2.6.32-573.7.1.el6.x86_64 #1 SMP Tue Sep 22 22:00:00 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux

[root at gfs-int02.mgmt ~]$ yum list installed | grep gluster
glusterfs.x86_64                   3.7.5-1.el6              @nwea-util          
glusterfs-api.x86_64               3.7.5-1.el6              @nwea-util          
glusterfs-cli.x86_64               3.7.5-1.el6              @nwea-util          
glusterfs-client-xlators.x86_64    3.7.5-1.el6              @nwea-util          
glusterfs-fuse.x86_64              3.7.5-1.el6              @nwea-util          
glusterfs-geo-replication.x86_64   3.7.5-1.el6              @nwea-util          
glusterfs-libs.x86_64              3.7.5-1.el6              @nwea-util          
glusterfs-server.x86_64            3.7.5-1.el6              @nwea-util  

Please let me know if further information or specific full log files would be
helpful.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.