[Gluster-users] Gluster usage scenarios in HPC cluster management
Erik Jacobson
erik.jacobson at hpe.com
Fri Mar 19 15:03:20 UTC 2021
A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.
If you are not interested in reading about what I'm up to, just delete
this and move on.
I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the past.
Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.
We currently use gluster in two parts of cluster management.
In fact, gluster in our management node infrastructure is helping us to
provide scaling and consistency to some of the largest clusters in the
world, clusters in the TOP100 list. While I can get in to trouble by
sharing too much, I will just say that trends are continuing and the
future may have some exciting announcements on where on TOP100 certain
new giant systems may end up in the coming 1-2 years.
At HPE, HPCM is the "traditional cluster manager." There is another team
that develops a more cloud-like solution and I am not discussing that
solution here.
Use Case #1: Leader Nodes and Scale Out
------------------------------------------------------------------------------
- Why?
* Scale out
* Redundancy (combined with CTDB, any leader can fail)
* Consistency (All servers and compute agree on what the content is)
- Cluster manager has an admin or head node and zero or more leader nodes
- Leader nodes are provisioned in groups of 3 to use distributed
replica-3 volumes (although at least one customer has interest
in replica-5)
- We configure a few different volumes for different use cases
- We use Gluster NFS still because, over a year ago, Ganesha was not
working with our workload and we haven't had time to re-test and
engage with the community. No blame - we would also owe making sure
our settings are right.
- We use CTDB for a measure of HA and IP alias management. We use this
instead of pacemaker to reduce complexity.
- The volume use cases are:
* Image sharing for diskless compute nodes (sometimes 6,000 nodes)
-> Normally squashFS image files for speed/efficiency exported NFS
-> Expanded ("chrootable") traditional NFS trees for people who
prefer that, but they don't scale as well and are slower to boot
-> Squashfs images sit on a sharded volume while traditional gluster
used for expanded tree.
* TFTP/HTTP for network boot/PXE including miniroot
-> Spread across leaders too due so one node is not saturated with
PXE/DHCP requests
-> Miniroot is a "fatter initrd" that has our CM toolchain
* Logs/consoles
-> For traditional logs and consoles (HCPM also uses
elasticsearch/kafka/friends but we don't put that in gluster)
-> Separate volume to have more non-cached friendly settings
* 4 total volumes used (one sharded, one heavily optimized for
caching, one for ctdb lock, and one traditional for logging/etc)
- Leader Setup
* Admin node installs the leaders like any other compute nodes
* A setup tool operates that configures gluster volumes and CTDB
* When ready, an admin/head node can be engaged with the leaders
* At that point, certain paths on the admin become gluster fuse mounts
and bind mounts to gluster fuse mounts.
- How images are deployed (squashfs mode)
* User creates an image using image creation tools that make a
chrootable tree style image on the admin/head node
* mksquashfs will generate a squashfs image file on to a shared
storage gluster mount
* Nodes will mount the filesystem with the squashfs images and then
loop mount the squashfs as part of the boot process.
- How are compute nodes tied to leaders
* We simply have a variable in our database where human or automated
discovery tools can assign a given node to a given IP alias. This
works better for us than trying to play routing tricks or load
balance tricks
* When leaders PXE, the DHCP response includes next-server and the
compute node uses the leader IP alias for the tftp/http for
getting the boot loader DHCP config files are on shared storage
to facilitate future scaling of DHCP services.
* ipxe or grub2 network config files then fetch the kernel, initrd
* initrd has a small update to load a miniroot (install environment)
which has more tooling
* Node is installed (for nodes with root disks) or does a network boot
cycle.
- Gluster sizing
* We typically state compute nodes per leader but this is not for
gluster per-se. Squashfs image objects are very efficient and
probably would be fine for 2k nodes per leader. Leader nodes provide
other services including console logs, system logs, and monitoring
services.
* Our biggest deployment at a customer site right now has 24 leader
nodes. Bigger systems are coming.
- Startup scripts - Getting all the gluster mounts and many bind mounts
used in the solution, as well as ensuring gluster mounts and ctdb lock
is available before ctdb start was too painful for my brian. So we
have systemd startup scripts that sanity test and start things
gracefully.
- Future: The team is starting to test what a 96-leader (96 gluster
servers) might look like for future exascale systems.
- Future: Some customers have interest in replica-5 instead of
replica-3. We want to test performance implications.
- Future: Move to Ganesha, work with community if needed
- Future: Work with Gluster community to make gluster fuse mounts
efficient instead of NFS (may be easier with image objects than it was
the old way with fully expanded trees for images!)
- Challenges:
* Every month someone tells me gluster is going to die because of Red
Hat vs IBM and I have to justify things. It is getting harder.
* Giant squashfs images fail - mksquashfs reports error - at around
90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
the bandwidth to dig in right now but one upset customer. Work
arounds provided to move development environment for that customer
out of the operating system image.
* Since we have our own build and special use cases, we're on our own
for support (by "on our own" I mean no paid support, community help
only). Our complex situations can produce some cases that you guys
don't see and debugging them can take a month or more with the
volunteer nature of the community. Paying for support is harder,
even if it were viable politically, since we support 6 distros and 3
distro providers. Of course, paying for support is never the
preference of management. It might be an interesting thing to
investigate.
* Any gluster update is terror. We don't update much because finding a
gluster version that is stable for all our use cases PLUS being able to
test at scale which means thousands of nodes is hard. We did some
internal improvements here where we emulate a 2500-node-cluster
using virtual machines on a much smaller cluster. However, it isn't
ideal. So we start lagging the community over time until some
problem forces us to refresh. Then we tip-toe in to the update. We
most recently updated to gluster79 and it solved several problems
related to use case #2 below.
* Due to lack of HW, testing the SU Leader solution is hard because of
the number of internal clusters. However, I recently moved my
primary development to my beefy desktop where I have a full cluster
stack including 3 leader nodes with gluster running in virtual
machines. So we have eliminated an excuse preventing internal people
from playing with the solution stack.
* Growing volumes, replacing bricks, and replacing servers work.
However, the process is very involved and quirky for us. I have
complicated code that has to do more than I'd really like to do to
simply wheel in a complete leader replacement for a failed one. Even
with our tools, we often send up with some glusterd's not starting
right and have to restart a few times to get a node or brick
replacement going. I wish the process were just as simple as running
a single command or set of commands and having it do the right
thing.
-> Finding two leaders to get a complete set of peer files
-> Wedging in the replacement node with the UUID the node in that
position had before
-> Don't accidentally give a gluster peer it's own peer file
-> Then go through an involved replace brick procedure
-> "replace" vs " reset"
-> A complicated dance with XATTRs that I still don't understand
-> Ensuring indices and .glusterfs pre-exist with right
permissions
My hope is I just misunderstand and will bring this up in a future
discussion.
Use Case #2: Admin/Head Node High Availability
------------------------------------------------------------------------------
- We used to have an admin node HA solution that contained two servers
and an external storage device. A VM was used for the "real admin
node" provided by the two servers.
- This solution was expensive due to the external storage
- This solution was less optimal due to not having true quorum
- Building on our gluster experience, we removed the external storage
and added a 3rd server.
(Due to some previous experience with DRBD we elected to stay with
gluster here)
- We build a gluster volume shared across the 3 servers, sharded, which
primarily holds a virtual machine image file used by the admin node VM
- The physical nodes use bonding for network redundancy and bridges to
feed them in to the VM.
- We use pacemaker in this case to manage the setup
- It's pretty simple - pacemaker rules manage a VirtualDomain instance
and a simple ssh monitor makes sure we can get in to it.
- We typically have a 3-12T single image sitting on gluster sharded
shared storage used by the virtual machine, which forms the true admin
node.
- We set this up with manual instructions but tooling is coming soon to
aid in automated setup of this solution
- This solution is in use actively in at least 4 very large
supercomputers.
- I am impressed by the speed of the solution on the sharded volume. The
VM image creation speed using libvirt to talk to the image file hosted
on a sharded gluster volume works slick.
(We use the fuse mount because we don't want to build our own
qemu/libvirt, which would be needed at least for SLES and maybe
RHEL too since we have our own gluster build)
- Challenges:
* Not being able to boot all of a sudden was a terror for us (where
the VM would only see the disk size as the size of a shard at random
times).
-> Thankfully community helped guide us to gluster79 and that resolved it
* People keep asking to make a 2-node version but I push back.
Especially with gluster but honestly with other solutions too, don't
cheap out with arbitration is what I try to tell people.
Network Utilization
------------------------------------------------------------------------------
- We tyhpically have 2x 10G bonds on leaders
- For simplicity, we co-mingle gluster and compute node traffic together
- It is very rare that we approach 20G full utilization even in very
intensive operations
- Newer solutions may increase the speed of the bonds, but it isn't due
to a current pressing need.
- Locality Challenge:
* For future Exascale systems, we may need to become concerned about
locality
* Certain compute nodes are far closer to some gluster servers than
others
* And gluster servers themselves need to talk among themselves but
could be stretched across the topology
* We have tools to monitor switch link utilization and so far have
not hit a scary zone
* Somewhat complicated by fault tolerance. It would be sad to design
the leaders such that a PDU goes bad so you lose qourum because
3 leaders in the same replica-3 were on the same PDU
* But spreading them out may have locality implications
* This is a future area of study for us. We have various ideas in
mind if a problem strikes.
More information about the Gluster-users
mailing list