[Gluster-users] gluster NFS hang observed mounting or umounting at scale

Sat Jan 25 20:36:09 UTC 2020

Hello. I'm sorry this got long :(

So we are using gluster, as mentioned in posts several months ago, in
some cluster management solutions. The solutions use a typical head
node, and a set of leader nodes to manage the cluster.

compute nodes, sometimes thousands of them, mount NFS roots from one of
several leader node. Each leader node is a gluster server and gluster
NFS server. While we have some bugs to work out, I have not often had
systems big enough to test with to report the issues to you guys and get
help.

Recently, we added to our RO NFS solution to include RW NFS mounts. To
make them fast, we create image files, put XFS filesystmes on them, and
mount them as loopback. Then we over-mount the writable paths like
/etc, /var, and similar in to this XFS-on-top-of-NFS image. It has been
scaling great.

However, when we started getting around 1000 nodes using both a 3x1
(more common) and 3x24 (some testing) gluster setup, some nodes get stuck.

This problem shows up as NFS Server Not responding for the read-write
mount point.  It also shows up as a hang trying to mount the XFS image
file (that resides on the NFS server).

I wrote some test scripts to try to duplicate this problem on a smaller
cluster and I had some success - one node showed the failure mode. I
have 367 compute nodes, 3 leader/gluster/nfs servers, and one head
node.

In this example, /tmp/mnt1 is a read-write NFS mount that contains
an XFS image:

/tmp/mnt1/r41c4t3n3/rhel8.0/xfs-test.img

I blasted some scripts that try to emulate what our cluster manager does
for creating the sparse file, creating an xfs image on it, and growing it. I
ran it a few times and one node developed the problem and could not unmount
/tmp/mnt1. It reported NFS Server Not Responding.

The gluster NFS log has this entry:
[2020-01-25 19:07:33.085806] E [MSGID: 109040] [dht-helper.c:1388:dht_migration_complete_check_task] 0-cm_shared-dht: 19bd72f0-6863-4f1d-80dc-a426db8670b8: failed to lookup the file on cm_shared-dht [Stale file handle]
[2020-01-25 19:07:33.085848] W [MSGID: 112199] [nfs3-helpers.c:3578:nfs3_log_commit_res] 0-nfs-nfsv3: /image/images_rw_nfs/r41c4t1n1/rhel8.0/xfs-test.img => (XID: cb501b58, COMMIT: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), wverf: 1579988225

Using a gluster mount, xfs-test.img does not seem to exist in that path
(even though nothing should have deleted it). It does not show up in
"ls" from the nfs client nor a gluster mount. However, if you do
something like this on the file from an NFS path:

test -r /tmp/mnt1/r41c4t3n3/rhel8.0/xfs-test.img

You hang.

Same is true from a different compute node/nfs client using the same
path above (and the same gluster nfs server).

My work around is to purge everything in the gluster shared storage
.../r41c4t3n3/* and reboot the node (r41c4t3n3 in this case).
It will then be re-created/resolved. However, it isn't desirable
because the node's writable persistent storage was killed in the process.
(meaning data loss).

So this is an unconventional solution but it has been working very well
for us and one solution will grow to 5,120 servers soon with this. Hoping
you can help me work through this.

I'm not an expert on gluster and gluster NFS internals and I know I
haven't sent enough information. I'm hoping you can help me provide
what would be helpful. I have added a compressed tarball attachment with
some data.

My nfs logs on all three servers are more than 1.3GB in size even though
I set up the servers today. So I included a 10000 tail of one of them
in the tarball (the gluster server acting as the NFS server for the
test case that failed). I included all the glusterfs logs excluding
nfs.log (all servers). I have the volume status and settings too. I also put
in my test case so you can see how it worked to duplicate the problem
the factory reported to us on the bigger system.

Gluster version info:

glusterfs 4.1.6
I have one patch for STACK_UNWIND_STRICT/MDC_STACK_UNWIND that
fixes an md-cache-statfs issue that was fixed in a newer version.
I included the patch in the attachment.

Note: I am happy to try newer versions of gluster but this has been
stable. I've tried newer versions in the past and lost stability at scale
(and we don't have in-house systems the size customers have) --
therefore I tread carefully. Now is a good time to try newer versions if
the community thinks it would help as I may get one day with a 1,000
node system (possibly Monday).

ganesha was not stable for this workload so that's why I'm still using
GNFS. 

For the test case details...
- node-mount-rw-and-rw-image.sh:
  * helper script copied to all the test nodes/clients (367 nodes)
  * Mounts the RW NFS path to a temp location
  * Creates the sparse file, the xfs image on the file, mounts it

- mount-rw-and-rw-image.sh:
  * runs on head node
  * Copies the above test script to all 367 compute nodes
  * runs the test script in parallel

- umount-rw-and-rw-image.sh:
  * unounts the test mounts
  * After running mount-rw-and-rw-image.sh and this script several
    times, one node of the 367 developed the problem. We hit it with
    larger numbers on the 1,000 node systems in the factory that I can't
    debug on.

Erik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: collection.tar.xz
Type: application/x-xz
Size: 447296 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200125/8942b4df/attachment-0001.xz>