[Gluster-users] Gluster usage scenarios in HPC cluster management

Erik Jacobson erik.jacobson at hpe.com
Fri Mar 19 15:03:20 UTC 2021


A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

If you are not interested in reading about what I'm up to, just delete
this and move on.

I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the past.
Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.



We currently use gluster in two parts of cluster management.

In fact, gluster in our management node infrastructure is helping us to
provide scaling and consistency to some of the largest clusters in the
world, clusters in the TOP100 list. While I can get in to trouble by
sharing too much, I will just say that trends are continuing and the
future may have some exciting announcements on where on TOP100 certain
new giant systems may end up in the coming 1-2 years.

At HPE, HPCM is the "traditional cluster manager." There is another team
that develops a more cloud-like solution and I am not discussing that
solution here.


Use Case #1: Leader Nodes and Scale Out
------------------------------------------------------------------------------
- Why?
  * Scale out
  * Redundancy (combined with CTDB, any leader can fail)
  * Consistency (All servers and compute agree on what the content is)

- Cluster manager has an admin or head node and zero or more leader nodes

- Leader nodes are provisioned in groups of 3 to use distributed
  replica-3 volumes (although at least one customer has interest
  in replica-5)

- We configure a few different volumes for different use cases

- We use Gluster NFS still because, over a year ago, Ganesha was not
  working with our workload and we haven't had time to re-test and
  engage with the community. No blame - we would also owe making sure
  our settings are right.

- We use CTDB for a measure of HA and IP alias management. We use this
  instead of pacemaker to reduce complexity.

- The volume use cases are:
  * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
    -> Normally squashFS image files for speed/efficiency exported NFS
    -> Expanded ("chrootable") traditional NFS trees for people who
       prefer that, but they don't scale as well and are slower to boot
    -> Squashfs images sit on a sharded volume while traditional gluster
       used for expanded tree.
  * TFTP/HTTP for network boot/PXE including miniroot
    -> Spread across leaders too due so one node is not saturated with
       PXE/DHCP requests
    -> Miniroot is a "fatter initrd" that has our CM toolchain
  * Logs/consoles
    -> For traditional logs and consoles (HCPM also uses
       elasticsearch/kafka/friends but we don't put that in gluster)
    -> Separate volume to have more non-cached friendly settings
  * 4 total volumes used (one sharded, one heavily optimized for
    caching, one for ctdb lock, and one traditional for logging/etc)

- Leader Setup
  * Admin node installs the leaders like any other compute nodes
  * A setup tool operates that configures gluster volumes and CTDB
  * When ready, an admin/head node can be engaged with the leaders
  * At that point, certain paths on the admin become gluster fuse mounts
    and bind mounts to gluster fuse mounts.

- How images are deployed (squashfs mode)
  * User creates an image using image creation tools that make a
    chrootable tree style image on the admin/head node
  * mksquashfs will generate a squashfs image file on to a shared
    storage gluster mount
  * Nodes will mount the filesystem with the squashfs images and then
    loop mount the squashfs as part of the boot process.

- How are compute nodes tied to leaders
  * We simply have a variable in our database where human or automated
    discovery tools can assign a given node to a given IP alias. This
    works better for us than trying to play routing tricks or load
    balance tricks
  * When leaders PXE, the DHCP response includes next-server and the
    compute node uses the leader IP alias for the tftp/http for
    getting the boot loader DHCP config files are on shared storage
    to facilitate future scaling of DHCP services.
  * ipxe or grub2 network config files then fetch the kernel, initrd
  * initrd has a small update to load a miniroot (install environment)
     which has more tooling
  * Node is installed (for nodes with root disks) or does a network boot
    cycle.

- Gluster sizing
  * We typically state compute nodes per leader but this is not for
    gluster per-se. Squashfs image objects are very efficient and
    probably would be fine for 2k nodes per leader. Leader nodes provide
    other services including console logs, system logs, and monitoring
    services.
  * Our biggest deployment at a customer site right now has 24 leader
    nodes. Bigger systems are coming.

- Startup scripts - Getting all the gluster mounts and many bind mounts
  used in the solution, as well as ensuring gluster mounts and ctdb lock
  is available before ctdb start was too painful for my brian. So we
  have systemd startup scripts that sanity test and start things
  gracefully.

- Future: The team is starting to test what a 96-leader (96 gluster
  servers) might look like for future exascale systems.

- Future: Some customers have interest in replica-5 instead of
  replica-3. We want to test performance implications.

- Future: Move to Ganesha, work with community if needed

- Future: Work with Gluster community to make gluster fuse mounts
  efficient instead of NFS (may be easier with image objects than it was
  the old way with fully expanded trees for images!)

- Challenges:
  * Every month someone tells me gluster is going to die because of Red
    Hat vs IBM and I have to justify things. It is getting harder.
  * Giant squashfs images fail - mksquashfs reports error - at around
    90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
    the bandwidth to dig in right now but one upset customer. Work
    arounds provided to move development environment for that customer
    out of the operating system image.
  * Since we have our own build and special use cases, we're on our own
    for support (by "on our own" I mean no paid support, community help
    only). Our complex situations can produce some cases that you guys
    don't see and debugging them can take a month or more with the
    volunteer nature of the community. Paying for support is harder,
    even if it were viable politically, since we support 6 distros and 3
    distro providers. Of course, paying for support is never the
    preference of management. It might be an interesting thing to
    investigate.
  * Any gluster update is terror. We don't update much because finding a
    gluster version that is stable for all our use cases PLUS being able to
    test at scale which means thousands of nodes is hard. We did some
    internal improvements here where we emulate a 2500-node-cluster
    using virtual machines on a much smaller cluster. However, it isn't
    ideal. So we start lagging the community over time until some
    problem forces us to refresh. Then we tip-toe in to the update. We
    most recently updated to gluster79 and it solved several problems
    related to use case #2 below.
  * Due to lack of HW, testing the SU Leader solution is hard because of
    the number of internal clusters. However, I recently moved my
    primary development to my beefy desktop where I have a full cluster
    stack including 3 leader nodes with gluster running in virtual
    machines. So we have eliminated an excuse preventing internal people
    from playing with the solution stack.
  * Growing volumes, replacing bricks, and replacing servers work.
    However, the process is very involved and quirky for us. I have
    complicated code that has to do more than I'd really like to do to
    simply wheel in a complete leader replacement for a failed one. Even
    with our tools, we often send up with some glusterd's not starting
    right and have to restart a few times to get a node or brick
    replacement going. I wish the process were just as simple as running
    a single command or set of commands and having it do the right
    thing.
       -> Finding two leaders to get a complete set of peer files
       -> Wedging in the replacement node with the UUID the node in that
          position had before
       -> Don't accidentally give a gluster peer it's own peer file
       -> Then go through an involved replace brick procedure
       -> "replace" vs " reset"
       -> A complicated dance with XATTRs that I still don't understand
       -> Ensuring indices and .glusterfs pre-exist with right
          permissions
    My hope is I just misunderstand and will bring this up in a future
    discussion.




Use Case #2: Admin/Head Node High Availability
------------------------------------------------------------------------------
- We used to have an admin node HA solution that contained two servers
  and an external storage device. A VM was used for the "real admin
  node" provided by the two servers.

- This solution was expensive due to the external storage

- This solution was less optimal due to not having true quorum

- Building on our gluster experience, we removed the external storage
  and added a 3rd server.
     (Due to some previous experience with DRBD we elected to stay with
      gluster here)

- We build a gluster volume shared across the 3 servers, sharded, which
  primarily holds a virtual machine image file used by the admin node VM

- The physical nodes use bonding for network redundancy and bridges to
  feed them in to the VM.

- We use pacemaker in this case to manage the setup

- It's pretty simple - pacemaker rules manage a VirtualDomain instance
  and a simple ssh monitor makes sure we can get in to it.

- We typically have a 3-12T single image sitting on gluster sharded
  shared storage used by the virtual machine, which forms the true admin
  node.

- We set this up with manual instructions but tooling is coming soon to
  aid in automated setup of this solution

- This solution is in use actively in at least 4 very large
  supercomputers.
  
- I am impressed by the speed of the solution on the sharded volume. The
  VM image creation speed using libvirt to talk to the image file hosted
  on a sharded gluster volume works slick.
    (We use the fuse mount because we don't want to build our own
     qemu/libvirt, which would be needed at least for SLES and maybe
     RHEL too since we have our own gluster build)

- Challenges:
  * Not being able to boot all of a sudden was a terror for us (where
    the VM would only see the disk size as the size of a shard at random
    times).
    -> Thankfully community helped guide us to gluster79 and that resolved it

  * People keep asking to make a 2-node version but I push back.
    Especially with gluster but honestly with other solutions too, don't
    cheap out with arbitration is what I try to tell people.


Network Utilization
------------------------------------------------------------------------------
- We tyhpically have 2x 10G bonds on leaders
- For simplicity, we co-mingle gluster and compute node traffic together
- It is very rare that we approach 20G full utilization even in very
  intensive operations
- Newer solutions may increase the speed of the bonds, but it isn't due
  to a current pressing need.
- Locality Challenge:
   * For future Exascale systems, we may need to become concerned about
     locality
   * Certain compute nodes are far closer to some gluster servers than
     others
   * And gluster servers themselves need to talk among themselves but
     could be stretched across the topology
   * We have tools to monitor switch link utilization and so far have
     not hit a scary zone
   * Somewhat complicated by fault tolerance. It would be sad to design
     the leaders such that a PDU goes bad so you lose qourum because
     3 leaders in the same replica-3 were on the same PDU
   * But spreading them out may have locality implications
   * This is a future area of study for us. We have various ideas in
     mind if a problem strikes.


More information about the Gluster-users mailing list