[Bugs] [Bug 1303259] New: cyclic NFS daemon crash when stopping a volume with active NFS connections in 3.7.5
bugzilla at redhat.com
bugzilla at redhat.com
Sat Jan 30 00:21:39 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1303259
Bug ID: 1303259
Summary: cyclic NFS daemon crash when stopping a volume with
active NFS connections in 3.7.5
Product: GlusterFS
Version: 3.7.5
Component: nfs
Severity: high
Assignee: bugs at gluster.org
Reporter: kris.laib at nwea.org
CC: bugs at gluster.org
Description of problem:
=======================
We recently found a reproducible issue in 3.7.5 which causes the NFS service to
get repeatedly taken offline when an in-use volume is stopped.
How reproducible:
=================
100%
Methods of reproducing:
=======================
A) Have an active NFS mount from a Linux client, and while data is being either
read form or written to that mount, issue a "volume stop" on gluster. To
simulate io, I'm using a simple dd from /dev/zero
B) Similar to A, but instead of having active data movement, simply have a
shell on the client be sitting in the mounted directory. Once the volume is
stopped, perform an "ls" from the client to trigger the crash. This only
works if you were already in the mounted directory while the stop was issued.
Actual results:
===============
For either A or B, the NFS service on the gluster node the client was connected
to will continue to crash at X interval (~5min) if manually brought back online
after each crash. This will continue to occur until the offending hung process
on the client is killed, or the gluster volume is brought back online.
Each time the NFS service crashes, a large core dump is left on the gluster
node in "/" for the NFS host was communicating with. The dump from this test
was 641MB.
Log information:
===============
(from nfs.log)
[2016-01-29 23:48:58.996528] E [nfs3.c:2303:nfs3_write] 0-nfs-nfsv3: Failed to
map FH to vol: client=10.1.254.125:872,
exportid=d9c54d47-26ed-4305-9650-042d28e79234,
gfid=f38a51a5-9977-4de5-a12b-792b6bfd30a0
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-01-29 23:48:58
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f30494309b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x32f)[0x7f304945051f]
/lib64/libc.so.6(+0x326a0)[0x7f3047dd06a0]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3_write+0x244)[0x7f303b1ea724]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3svc_write+0xbc)[0x7f303b1eab6c]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x7f30491f9f74]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7f30491fa173]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f30491fbb28]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xabd5)[0x7f303df82bd5]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc7bd)[0x7f303df847bd]
/usr/lib64/libglusterfs.so.0(+0x8b180)[0x7f3049496180]
/lib64/libpthread.so.0(+0x7a51)[0x7f304851ca51]
/lib64/libc.so.6(clone+0x6d)[0x7f3047e8693d]
---------
Environment Info:
================
This is a 3 node cluster, node 1 is only for quorum, nodes 2/3 serve data from
1x2 replicated vols. We utilize CTBD for NFS HA.
This failure has been repeated several times in 2 identically setup clusters in
different datacenters
"ctdb status" and "peer status" show healthy prior to starting the tests
Underlying bricks are XFS, backed by iscsi SAN LUNs, carved up via LVM.
This is reproducible newly created volumes.
(this is the volume I was using when generating the above nfs.log error)
[root at gfs-int02.mgmt ~]$ gluster volume info res_temp
Volume Name: res_temp
Type: Replicate
Volume ID: d9c54d47-26ed-4305-9650-042d28e79234
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-int02.mgmt:/data/glusterfs/res_temp_brick1/brick1
Brick2: gfs-int03.mgmt:/data/glusterfs/res_temp_brick1/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 10.123.12.47,10.1.254.125
performance.readdir-ahead: on
nfs.export-volumes: on
nfs.addr-namelookup: Off
nfs.disable: off
network.ping-timeout: 5
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 51%
[root at gfs-int02.mgmt ~]$ xfs_info /dev/mapper/int-res_temp_brick1
meta-data=/dev/mapper/int-res_temp_brick1 isize=512 agcount=4,
agsize=25600000 blks
= sectsz=4096 attr=2, projid32bit=0
data = bsize=4096 blocks=102400000, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=50000, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root at gfs-int02.mgmt ~]$ cat /etc/issue
CentOS release 6.7 (Final)
Kernel \r on an \m
[root at gfs-int02.mgmt ~]$ uname -a
Linux gfs-int02.mgmt 2.6.32-573.7.1.el6.x86_64 #1 SMP Tue Sep 22 22:00:00 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux
[root at gfs-int02.mgmt ~]$ yum list installed | grep gluster
glusterfs.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-api.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-cli.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-client-xlators.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-fuse.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-geo-replication.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-libs.x86_64 3.7.5-1.el6 @nwea-util
glusterfs-server.x86_64 3.7.5-1.el6 @nwea-util
Please let me know if further information or specific full log files would be
helpful.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list