[Gluster-users] NFS crashes on Gluster 3.4.0

Tue Sep 24 12:20:02 UTC 2013

Hi,

since we switched to NFS(due to many small files) we are experiencing heavy problems with Glusters NFS daemon. About once a day, the Gluster NFS process just crashes on one of the machines and doesn't come up again until I issue a restart of the Gluster daemon on that node. Sometimes the crashed node will even crash again after the restart.

We have a ~2TB volume with 6 bricks on 5 servers, accessed by 12 NFS clients and one FUSE client.

In the nfs logs there's something like the following:

tail -n 100 /var/log/glusterfs/nfs.log
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
[...]
frame : type(0) op(0)

signal received: 11
time of crash: 2013-08-15 14:08:39
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0
/lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fac361904c0]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_spin_lock+0x0)[0x7fac36523a50]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(fd_unref+0x36)[0x7fac36b96966]
/usr/lib/x86_64-linux-gnu/glusterfs/3.4.0/xlator/protocol/client.so(client_local_wipe+0x28)[0x7fac31f6a4f8]
/usr/lib/x86_64-linux-gnu/glusterfs/3.4.0/xlator/protocol/client.so(client3_3_opendir_cbk+0x19c)[0x7fac31f8353c]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7fac36957bd5]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xc5)[0x7fac36957f35]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7fac36954627]
/usr/lib/x86_64-linux-gnu/glusterfs/3.4.0/rpc-transport/socket.so(+0xa1d1)[0x7fac32e091d1]
/usr/lib/x86_64-linux-gnu/glusterfs/3.4.0/rpc-transport/socket.so(+0xa81c)[0x7fac32e0981c]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5e553)[0x7fac36bbd553]
/usr/sbin/glusterfs(main+0x3e3)[0x7fac37007883]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fac3617b76d]
/usr/sbin/glusterfs(+0x5c79)[0x7fac37007c79]
---------

Is there anything we could do to prevent this or at least something to find the cause of this? At the moment we have the ugly workaround to check the NFS status via cron and restart the server if necessary but that's nothing we find suitable for larger deployments..