[Gluster-users] NFS service dying
Niels de Vos
ndevos at redhat.com
Fri Jan 13 11:39:11 UTC 2017
On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen wrote:
> I'm running into an issue where the gluster nfs service keeps dying on a
> new cluster I have setup recently. We've been using Gluster on several
> other clusters now for about a year or so and I have never seen this
> issue before, nor have I been able to find anything remotely similar to
> it while searching on-line. I initially was using the latest version in
> the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
> using the next one down, 3.8.7-1. Both behave the same for me.
> What I was seeing was after a while the nfs service on the NAS server
> would suddenly die after a number of processes had run on the app server
> I had connected to the new NAS servers for testing (we're upgrading the
> NAS servers for this cluster to newer hardware and expanded storage, the
> current production NAS servers are using nfs-kernel-server with no type
> of clustering of the data). I checked the logs but all it showed me was
> something that looked like a stack trace in the nfs.log and the
> glustershd.log showed the nfs service disconnecting. I turned on
> debugging but it didn't give me a whole lot more, and certainly nothing
> that helps me identify the source of my issue. It is pretty consistent
> in dying shortly after I mount the file system on the servers and start
> testing, usually within 15-30 minutes. But if I have nothing using the
> file system, mounted or no, the service stays running for days. I tried
> mounting it using the gluster client, and it works fine, but I can;t use
> that due to the performance penalty, it slows the websites down by a few
> seconds at a minimum.
This seems to be related to the NLM protocol that Gluster/NFS provides.
Earlier this week one of our Red Hat quality engineers also reported
this (or a very similar) problem.
At the moment I suspect that this is related to re-connects of some
kind, but I have not been able to identify the cause sufficiently to be
sure. This definitely is a coding problem in Gluster/NFS, but the more I
look at the NLM implementation, the more potential issues I see with it.
If the workload does not require locking operations, you may be able to
work around the problem by mounting with "-o nolock". Depending on the
application, this can be safe or cause data corruption...
An other alternative is to use NFS-Ganesha instead of Gluster/NFS.
Ganesha is more mature than Gluster/NFS and is more actively developed.
Gluster/NFS is being deprecated in favour of NFS-Ganesha.
> Here is the output from the logs one of the times it died:
> [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
> on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
> data available)
> [2017-01-10 19:06:20.265964] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> nfs has disconnected from glusterd.
> [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
> 0-NLM-client: address-family not specified, marking it as unspec for
> getaddrinfo to resolve from (remote-host: 10.20.5.13)
> [2017-01-10 19:06:20.135531] D [MSGID: 0]
> [common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
> (port-48963) for hostname: 10.20.5.13 and port: 48963
> [2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
> 0-logging-infra: Log buffer size reduced. About to flush 5 extra log
> [2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
> 0-logging-infra: Just flushed 5 extra log messages
> pending frames:
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 11
> time of crash:
> 2017-01-10 19:06:20
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.9.0
> The IP showing in the nfs.log is actually for a web server I was also
> testing with, not the app server, but it doesn't appear to me that would
> be the cause for the nfs service dying. I'm at a loss as to what is
> going on, and I need to try and get this fixed pretty quickly here, I
> was hoping to have this in production last Friday. If anyone has any
> ideas I'd be very grateful.
> Paul Allen
> Inetz System Administrator
> Gluster-users mailing list
> Gluster-users at gluster.org
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 801 bytes
Desc: not available
More information about the Gluster-users