[Gluster-users] NFS service dying

Wed Jan 11 18:58:29 UTC 2017

I'm running into an issue where the gluster nfs service keeps dying on a
new cluster I have setup recently. We've been using Gluster on several
other clusters now for about a year or so and I have never seen this
issue before, nor have I been able to find anything remotely similar to
it while searching on-line. I initially was using the latest version in
the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
using the next one down, 3.8.7-1. Both behave the same for me.

What I was seeing was after a while the nfs service on the NAS server
would suddenly die after a number of processes had run on the app server
I had connected to the new NAS servers for testing (we're upgrading the
NAS servers for this cluster to newer hardware and expanded storage, the
current production NAS servers are using nfs-kernel-server with no type
of clustering of the data). I checked the logs but all it showed me was
something that looked like a stack trace in the nfs.log and the
glustershd.log showed the nfs service disconnecting. I turned on
debugging but it didn't give me a whole lot more, and certainly nothing
that helps me identify the source of my issue. It is pretty consistent
in dying shortly after I mount the file system on the servers and start
testing, usually within 15-30 minutes. But if I have nothing using the
file system, mounted or no, the service stays running for days. I tried
mounting it using the gluster client, and it works fine, but I can;t use
that due to the performance penalty, it slows the websites down by a few
seconds at a minimum.

Here is the output from the logs one of the times it died:

glustershd.log:

[2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
data available)
[2017-01-10 19:06:20.265964] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
nfs has disconnected from glusterd.

nfs.log:

[2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
0-NLM-client: address-family not specified, marking it as unspec for
getaddrinfo to resolve from (remote-host: 10.20.5.13)
[2017-01-10 19:06:20.135531] D [MSGID: 0]
[common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
(port-48963) for hostname: 10.20.5.13 and port: 48963
[2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
0-logging-infra: Log buffer size reduced. About to flush 5 extra log
messages
[2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
0-logging-infra: Just flushed 5 extra log messages
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-01-10 19:06:20
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.9.0
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
/lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
/usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d]

The IP showing in the nfs.log is actually for a web server I was also
testing with, not the app server, but it doesn't appear to me that would
be the cause for the nfs service dying. I'm at a loss as to what is
going on, and I need to try and get this fixed pretty quickly here, I
was hoping to have this in production last Friday. If anyone has any
ideas I'd be very grateful.

-- 

Paul Allen

Inetz System Administrator