[Gluster-users] Setup recommendations

Mon Oct 19 03:56:20 UTC 2020

>Size is not that big, 600GB space with around half of that actually used.  GlusterFS servers themselves each have 4 cores and 12GB memory.  It might also be important to note that these are VMware hosted nodes that make use of  SAN storage for the datastores.

4 cores is quite low, especially when healing.

>Connected to that NFS (ganesha) exported share are just over 100 clients, all RHEL6 and RHEL7, some spanning 10 network hops away.  All of those clients are (currently) using the same virtual-IP, so all end up on the same server.

Why not FUSE ? Ganesha is suitable for UNIX and BSD systems that do not support FUSE.

>Note that I mentioned 'should', since at times it had anywhere between 250.000 and 1 million files in it (which of course is not advised).  Using some kind of hashing (subfolders spread per day/hour etc) was also already advised.
If you have multiple subdomains (from replicate -> to distributed-replicated) , you can also spread the load - yet 'find' won't be faster :)

Problems that are often seen:
>- Any kind of operation on VMware such as a vMotion, creating a VM snapshot etc. on the node that has these 100+ clients connected causes such a temporary pause that pacemaker decides to switch the resources (causing a failover of the virtual IP address, thus clients connected suffer delay).  
RH corosync defaults are not suitable for VMs. I prefer SUSE's defaults.
Consider increasing the 'token' and 'consensus' to a more meaningful values -> start with 10s token for example.

>One would expect this to last just shy under a minute, then clients would happily continue.  However connected clients are stuck with a non-working mountpoint (commands as df, ls, find etc simply hang.. they go into an uninterruptible sleep).
In regular HA NFS, there is a "notify" resource that notifies the clients about the failover. The stale happens because your IP is brought before the NFS export is ready. As you haven't provided HA details, I can't help much there.

>Mount are 'hard' mounts to insure guaranteed writes.
That's good. Also is needed for the HA to properly work.

>- Once the number of files are over the 100.000 mark (again into a single, unhashed, folder) any operation on that share becomes very sluggish (even a df, on a client, would take 20/30 seconds,  a find command would take minutes to complete).
I think it's expected...

>If anyone can spot any ideas for improvement ?
I would try to first switch to 'replica 3 arbiter 1' as current setup is wasting storage, next switch the clients to FUSE.
For performance improvements , I would add some SSDs in the game (tier 1+ storage) and use the SSD-based LUNs as lvm caching.

Best Regards,
Strahil Nikolov