[Gluster-users] Gluster usage scenarios in HPC cluster management

Mon Apr 5 15:54:12 UTC 2021

Hi Erik,

Thanks for sharing your unique use-case and solution. It was very
interesting to read your write-up.

I agree with your point; " * Growing volumes, replacing bricks, and
replacing servers work.
    However, the process is very involved and quirky for us. I have....."
in your use-case 1 last point.

I do seem to suffer from similar issues where glusterd just doesn't want to
start up correctly at first time, maybe also see:
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
one possible cause for not starting up correctly.
And secondly indeed for a "molten to the floor" server it would be great if
gluster instead of the current replace-brick commands, would have something
to replace a complete host, and would just recreate the all bricks (or
better yet all missing bricks, in case some survived from another
RAID/disk), which it would expect on that host, and would heal them all.
right now indeed this process is quite involved, and sometimes feels a bit
like performing black magic.

Best Olaf

Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson <erik.jacobson at hpe.com>:

> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
>
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
>
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
>
>
>
> We currently use gluster in two parts of cluster management.
>
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
>
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
>
>
> Use Case #1: Leader Nodes and Scale Out
>
> ------------------------------------------------------------------------------
> - Why?
>   * Scale out
>   * Redundancy (combined with CTDB, any leader can fail)
>   * Consistency (All servers and compute agree on what the content is)
>
> - Cluster manager has an admin or head node and zero or more leader nodes
>
> - Leader nodes are provisioned in groups of 3 to use distributed
>   replica-3 volumes (although at least one customer has interest
>   in replica-5)
>
> - We configure a few different volumes for different use cases
>
> - We use Gluster NFS still because, over a year ago, Ganesha was not
>   working with our workload and we haven't had time to re-test and
>   engage with the community. No blame - we would also owe making sure
>   our settings are right.
>
> - We use CTDB for a measure of HA and IP alias management. We use this
>   instead of pacemaker to reduce complexity.
>
> - The volume use cases are:
>   * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
>     -> Normally squashFS image files for speed/efficiency exported NFS
>     -> Expanded ("chrootable") traditional NFS trees for people who
>        prefer that, but they don't scale as well and are slower to boot
>     -> Squashfs images sit on a sharded volume while traditional gluster
>        used for expanded tree.
>   * TFTP/HTTP for network boot/PXE including miniroot
>     -> Spread across leaders too due so one node is not saturated with
>        PXE/DHCP requests
>     -> Miniroot is a "fatter initrd" that has our CM toolchain
>   * Logs/consoles
>     -> For traditional logs and consoles (HCPM also uses
>        elasticsearch/kafka/friends but we don't put that in gluster)
>     -> Separate volume to have more non-cached friendly settings
>   * 4 total volumes used (one sharded, one heavily optimized for
>     caching, one for ctdb lock, and one traditional for logging/etc)
>
> - Leader Setup
>   * Admin node installs the leaders like any other compute nodes
>   * A setup tool operates that configures gluster volumes and CTDB
>   * When ready, an admin/head node can be engaged with the leaders
>   * At that point, certain paths on the admin become gluster fuse mounts
>     and bind mounts to gluster fuse mounts.
>
> - How images are deployed (squashfs mode)
>   * User creates an image using image creation tools that make a
>     chrootable tree style image on the admin/head node
>   * mksquashfs will generate a squashfs image file on to a shared
>     storage gluster mount
>   * Nodes will mount the filesystem with the squashfs images and then
>     loop mount the squashfs as part of the boot process.
>
> - How are compute nodes tied to leaders
>   * We simply have a variable in our database where human or automated
>     discovery tools can assign a given node to a given IP alias. This
>     works better for us than trying to play routing tricks or load
>     balance tricks
>   * When leaders PXE, the DHCP response includes next-server and the
>     compute node uses the leader IP alias for the tftp/http for
>     getting the boot loader DHCP config files are on shared storage
>     to facilitate future scaling of DHCP services.
>   * ipxe or grub2 network config files then fetch the kernel, initrd
>   * initrd has a small update to load a miniroot (install environment)
>      which has more tooling
>   * Node is installed (for nodes with root disks) or does a network boot
>     cycle.
>
> - Gluster sizing
>   * We typically state compute nodes per leader but this is not for
>     gluster per-se. Squashfs image objects are very efficient and
>     probably would be fine for 2k nodes per leader. Leader nodes provide
>     other services including console logs, system logs, and monitoring
>     services.
>   * Our biggest deployment at a customer site right now has 24 leader
>     nodes. Bigger systems are coming.
>
> - Startup scripts - Getting all the gluster mounts and many bind mounts
>   used in the solution, as well as ensuring gluster mounts and ctdb lock
>   is available before ctdb start was too painful for my brian. So we
>   have systemd startup scripts that sanity test and start things
>   gracefully.
>
> - Future: The team is starting to test what a 96-leader (96 gluster
>   servers) might look like for future exascale systems.
>
> - Future: Some customers have interest in replica-5 instead of
>   replica-3. We want to test performance implications.
>
> - Future: Move to Ganesha, work with community if needed
>
> - Future: Work with Gluster community to make gluster fuse mounts
>   efficient instead of NFS (may be easier with image objects than it was
>   the old way with fully expanded trees for images!)
>
> - Challenges:
>   * Every month someone tells me gluster is going to die because of Red
>     Hat vs IBM and I have to justify things. It is getting harder.
>   * Giant squashfs images fail - mksquashfs reports error - at around
>     90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
>     the bandwidth to dig in right now but one upset customer. Work
>     arounds provided to move development environment for that customer
>     out of the operating system image.
>   * Since we have our own build and special use cases, we're on our own
>     for support (by "on our own" I mean no paid support, community help
>     only). Our complex situations can produce some cases that you guys
>     don't see and debugging them can take a month or more with the
>     volunteer nature of the community. Paying for support is harder,
>     even if it were viable politically, since we support 6 distros and 3
>     distro providers. Of course, paying for support is never the
>     preference of management. It might be an interesting thing to
>     investigate.
>   * Any gluster update is terror. We don't update much because finding a
>     gluster version that is stable for all our use cases PLUS being able to
>     test at scale which means thousands of nodes is hard. We did some
>     internal improvements here where we emulate a 2500-node-cluster
>     using virtual machines on a much smaller cluster. However, it isn't
>     ideal. So we start lagging the community over time until some
>     problem forces us to refresh. Then we tip-toe in to the update. We
>     most recently updated to gluster79 and it solved several problems
>     related to use case #2 below.
>   * Due to lack of HW, testing the SU Leader solution is hard because of
>     the number of internal clusters. However, I recently moved my
>     primary development to my beefy desktop where I have a full cluster
>     stack including 3 leader nodes with gluster running in virtual
>     machines. So we have eliminated an excuse preventing internal people
>     from playing with the solution stack.
>   * Growing volumes, replacing bricks, and replacing servers work.
>     However, the process is very involved and quirky for us. I have
>     complicated code that has to do more than I'd really like to do to
>     simply wheel in a complete leader replacement for a failed one. Even
>     with our tools, we often send up with some glusterd's not starting
>     right and have to restart a few times to get a node or brick
>     replacement going. I wish the process were just as simple as running
>     a single command or set of commands and having it do the right
>     thing.
>        -> Finding two leaders to get a complete set of peer files
>        -> Wedging in the replacement node with the UUID the node in that
>           position had before
>        -> Don't accidentally give a gluster peer it's own peer file
>        -> Then go through an involved replace brick procedure
>        -> "replace" vs " reset"
>        -> A complicated dance with XATTRs that I still don't understand
>        -> Ensuring indices and .glusterfs pre-exist with right
>           permissions
>     My hope is I just misunderstand and will bring this up in a future
>     discussion.
>
>
>
>
> Use Case #2: Admin/Head Node High Availability
>
> ------------------------------------------------------------------------------
> - We used to have an admin node HA solution that contained two servers
>   and an external storage device. A VM was used for the "real admin
>   node" provided by the two servers.
>
> - This solution was expensive due to the external storage
>
> - This solution was less optimal due to not having true quorum
>
> - Building on our gluster experience, we removed the external storage
>   and added a 3rd server.
>      (Due to some previous experience with DRBD we elected to stay with
>       gluster here)
>
> - We build a gluster volume shared across the 3 servers, sharded, which
>   primarily holds a virtual machine image file used by the admin node VM
>
> - The physical nodes use bonding for network redundancy and bridges to
>   feed them in to the VM.
>
> - We use pacemaker in this case to manage the setup
>
> - It's pretty simple - pacemaker rules manage a VirtualDomain instance
>   and a simple ssh monitor makes sure we can get in to it.
>
> - We typically have a 3-12T single image sitting on gluster sharded
>   shared storage used by the virtual machine, which forms the true admin
>   node.
>
> - We set this up with manual instructions but tooling is coming soon to
>   aid in automated setup of this solution
>
> - This solution is in use actively in at least 4 very large
>   supercomputers.
>
> - I am impressed by the speed of the solution on the sharded volume. The
>   VM image creation speed using libvirt to talk to the image file hosted
>   on a sharded gluster volume works slick.
>     (We use the fuse mount because we don't want to build our own
>      qemu/libvirt, which would be needed at least for SLES and maybe
>      RHEL too since we have our own gluster build)
>
> - Challenges:
>   * Not being able to boot all of a sudden was a terror for us (where
>     the VM would only see the disk size as the size of a shard at random
>     times).
>     -> Thankfully community helped guide us to gluster79 and that resolved
> it
>
>   * People keep asking to make a 2-node version but I push back.
>     Especially with gluster but honestly with other solutions too, don't
>     cheap out with arbitration is what I try to tell people.
>
>
> Network Utilization
>
> ------------------------------------------------------------------------------
> - We tyhpically have 2x 10G bonds on leaders
> - For simplicity, we co-mingle gluster and compute node traffic together
> - It is very rare that we approach 20G full utilization even in very
>   intensive operations
> - Newer solutions may increase the speed of the bonds, but it isn't due
>   to a current pressing need.
> - Locality Challenge:
>    * For future Exascale systems, we may need to become concerned about
>      locality
>    * Certain compute nodes are far closer to some gluster servers than
>      others
>    * And gluster servers themselves need to talk among themselves but
>      could be stretched across the topology
>    * We have tools to monitor switch link utilization and so far have
>      not hit a scary zone
>    * Somewhat complicated by fault tolerance. It would be sad to design
>      the leaders such that a PDU goes bad so you lose qourum because
>      3 leaders in the same replica-3 were on the same PDU
>    * But spreading them out may have locality implications
>    * This is a future area of study for us. We have various ideas in
>      mind if a problem strikes.
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210405/17bad49e/attachment.html>