[Gluster-users] Gluster usage scenarios in HPC cluster management

Mon Mar 22 15:54:37 UTC 2021

> > The stuff I work on doesn't use containers much (unlike a different
> > system also at HPE).
> By "pods" I meant "glusterd instance", a server hosting a collection of
> bricks.

Oh ok. The term is overloaded in my world.

> > I don't have a recipe, they've just always been beefy enough for
> > gluster. Sorry I don't have a more scientific answer.
> Seems that 64GB RAM are not enough for a pod with 26 glusterfsd
> instances and no other services (except sshd for management). What do
> you mean by "beefy enough"? 128GB RAM or 1TB?

We are currently using replica-3 but may also support replica-5 in the
future.

So if you had 24 leaders like HLRS, there would be 8 replica-3 at the
bottom layer, and then distributed across. (replicated/distributed
volumes)

So we would have 24 leader nodes, each leader would have a disk serving
4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
one is for logs, and one is heavily optimized for non-object expanded
tree NFS). The term "disk" is loose.

So each SU Leader (or gluster server) serving the 4 volumes, 8x3
configuration, in our world has some differences in CPU type and memory
and storage depending on order and preferences and timing (things always
move forward).

On an SU Leader, we typically do 2 RAID10 volumes with a RAID
controller including cache. However, we have moved to RAID1 in some cases with
better disks. Leaders store a lot of non-gluster stuff on "root" and
then gluster has a dedicated disk/LUN. We have been trying to improve
our helper tools to 100% wheel out a bad leader (say it melted in to the
floor) and replace it. Once we have that solid, and because our
monitoring data on the "root" drive is already redundant, we plan to
move newer servers to two NVME drives without RAID. One for gluster and
one for OS. If a leader melts in to the floor, we have a procedure to
discover a new node for that, install the base OS including
gluster/CTDB/etc, and then run a tool to re-integrate it in to the
cluster as an SU Leader node again and do the healing. Separately,
monitoring data outside of gluster will heal.

PS: I will note that I have a mini-SU-leader cluster on my desktop
(qemu/ libvirt) for development. It is a 1x3 set of SU Leaders, one head node,
and one compute node. I make an adjustment to reduce the gluster cache to fit
in the memory space. Works fine. Not real fast but good enough for development.

Specs of a leader node at a customer site:
 * 256G RAM
 * Storage: 
   - MR9361-8i controller
   - 7681GB root LUN (RAID1)
   - 15.4 TB for gluster bricks (RAID10)
   - 6 SATA SSD MZ7LH7T6HMLA-00005
 * AMD EPYC 7702 64-Core Processor
   - CPU(s):              128
   - On-line CPU(s) list: 0-127
   - Thread(s) per core:  2
   - Core(s) per socket:  64
   - Socket(s):           1
   - NUMA node(s):        4
 * Management Ethernet
   - Gluster and cluster management co-mingled
   - 2x40G (but 2x10G wouold be fine)