[Gluster-users] NFS crashes under load

Sun Nov 7 23:53:17 UTC 2010

I upgraded to GlusterFS 3.1 a couple of weeks ago and overall I am very
impressed; I think it is a big step forward.  Unfortunately there is one
"feature" that is causing me a big problem - the NFS process crashes
every few hours  when under load.  I have pasted the relevant error
messages from nfs.log at the end of this message.  The rest of the log
file is swamped with these messages incidentally.

[2010-11-06 23:07:04.977055] E [rpcsvc.c:1249:nfs_rpcsvc_program_actor]
nfsrpc: RPC program not available

There are no apparent problems while these errors are being produced so
this issue probably isn't relevant to the crashes.

To give an indication of what I mean by "under load", we have a small
HPC cluster that is used for running ocean models.  A typical model run
involves 20 processors, all needing to read simultaneously from the same
input data files at regular intervals during the run.  There are roughly
20 files, each ~1GB in size.  At the same time this is going on several
people, typically, are processing output from previous runs from this
and other (much bigger) clusters, chugging through hundreds of GB and
tens of thousands of files every few hours.  I don't think the
Gluster-NFS crashes are purely load dependant because they seem to occur
at different load levels, which is what leads me to suspect something
subtle related to the cluster's 20-processor model runs.  I would prefer
to use the GlusterFS client on the cluster's compute nodes, but
unfortunately the pre-FUSE Linux kernel has been customised in a way
that has thwarted all my attempts to build a FUSE module that the kernel
will accept (see
http://gluster.org/pipermail/gluster-users/2010-April/004538.html)

The servers that are exporting NFS are all running CentOS 5.5 with
GlusterFS installed from RPMs, and the GlusterFS volumes are distributed
(not repicated).  Two of the servers with GlusterFS bricks are actually
running SuSE Enterprise 10; I don't know if this is relevant.  I used
previous GlusterFS versions with SLES10 without any problems, but as
RPMs are not provided for SuSE I presume it is not an officially
supported distro.  For that reason I am only using the CentOS machines
as NFS servers for the GlusterFS volumes.

I would be very grateful for any suggested solutions or workarounds that
might help to prevent these NFS crashes.

-Dan.
nfs.log extract
------------------
[2010-11-06 23:07:10.380744] E [fd.c:506:fd_unref_unbind]
(-->/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)
[0x2aaaab30813e]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)
[0x2aaaab9a6da1]
(-->/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)
[0x2aaaab9b0bdd]))) : Assertion failed: fd->refcount
pending frames:

patchset: v3.1.0
signal received: 11
time of crash: 2010-11-06 23:07:10
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0
/lib64/libc.so.6[0x35746302d0]
/lib64/libpthread.so.0(pthread_spin_lock+0x2)[0x357520b722]
/usr/lib64/libglusterfs.so.0(fd_unref_unbind+0x3d)[0x38f223511d]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs3svc_readdir_fstat_cbk+0x22d)[0x2aaaab9b0bdd]
/usr/lib64/glusterfs/3.1.0/xlator/nfs/server.so(nfs_fop_fstat_cbk+0x41)[0x2aaaab9a6da1]
/usr/lib64/glusterfs/3.1.0/xlator/debug/io-stats.so(io_stats_fstat_cbk+0x8e)[0x2aaaab30813e]
/usr/lib64/libglusterfs.so.0(default_fstat_cbk+0x79)[0x38f2222a69]
/usr/lib64/glusterfs/3.1.0/xlator/performance/read-ahead.so(ra_attr_cbk+0x79)[0x2aaaaaeec459]
/usr/lib64/glusterfs/3.1.0/xlator/performance/write-behind.so(wb_fstat_cbk+0x9f)[0x2aaaaace402f]
/usr/lib64/glusterfs/3.1.0/xlator/cluster/distribute.so(dht_attr_cbk+0xf4)[0x2aaaab521d24]
/usr/lib64/glusterfs/3.1.0/xlator/protocol/client.so(client3_1_fstat_cbk+0x287)[0x2aaaaaacd2b7]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x38f1a0f2e2]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x38f1a0f4dd]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x38f1a0a77c]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaac3eb435f]
/usr/lib64/glusterfs/3.1.0/rpc-transport/socket.so(socket_event_handler+0x168)[0x2aaac3eb44e8]
/usr/lib64/libglusterfs.so.0[0x38f2236ee7]
/usr/sbin/glusterfs(main+0x37d)[0x4046ad]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x357461d994]
/usr/sbin/glusterfs[0x402dc9]
---------

-- 
Mr. D.A. Bretherton
Computer System Manager
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
University of Reading
Reading, RG6 6AL
UK

Tel. +44 118 378 5205
Fax: +44 118 378 6413