[Gluster-users] Gluster usage scenarios in HPC cluster management

Fri Mar 19 17:19:54 UTC 2021

Erik:

My apologies for not being more clear originally.

What I meant to say was that I was using GlusterFS for HPC jobs because my understanding is that most HPC environments often or tend to use, for example, NVMe SSDs for their high speed storage tier, but even those have a finite write endurance limit as well.

And whilst normally, for a large corporation, the consumption of the write endurance limit of the NVMe SSDs and their replacement would just be a cost of "normal part of doing business", but for a home lab, I can't afford to spend that kind of money whenever the drives wear out like that.

And this is what drove me to testing GlusterFS distributed stripped volume exported to NFS over RDMA so that the RAM was used both in the execution of the jobs as well as for the high speed scratch disk space during job execution such that it wouldn't be subject to the write endurance limits of NAND flash SSDs (NVMe or otherwise), nor the significantly slower performance of mechanically rotating hard disk drives.

So, I was talking about using GlusterFS for HPC as well, but in the context of job execution rather than more the "management" tasks/operations that you described in your message below.

Thank you.

Sincerely,
Ewen

________________________________
From: Erik Jacobson <erik.jacobson at hpe.com>
Sent: March 19, 2021 12:24 PM
To: Ewen Chan <alpha754293 at hotmail.com>
Cc: Erik Jacobson <erik.jacobson at hpe.com>; gluster-users at gluster.org <gluster-users at gluster.org>
Subject: Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

> But I've also tested using tmpfs (allocating half of the RAM per compute node)
> and exporting that as a distributed stripped GlusterFS volume over NFS over
> RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high
> speed "scratch disk space" that doesn't have the write endurance limits that
> NAND based flash memory SSDs have.

In my world, we leave the high speed networks to jobs so I don't have
much to offer. In our test SU Leader setup where we may not have disks,
we do carve gluster bricks out of TMPS mounts. However, in that test
case, designed to test the tooling and not the workload, I use iscsi to
emulate disks to test the true solution.

I will just mention that the cluster manager use of squashfs image
objects sitting on NFS mounts is very fast even on top of 20G (2x10G)
mgmt infrastructure. If you combine it with a TMPFS overlay, which is
our default, you will have a writable area in to TMPFS that doesn't
persist. You will have low memory usage.

For a 4-node cluster, you probably don't need to bother with squashfs
even and just mount the directory tree for the image at the right time.

By using tmpfs overlay and some post-boot configuration, you can perhaps
avoid the memory usage of what you are doing. As long as you don't need
to beat the crap out of root, an NFS root is fine and using gluster
backed disks is fine. Note that if you use exported trees with gnfs
instead of image objects, there are lots of volume tweaks you can make
to push efficiency up. For squashfs, I used a sharded volume.

It's easy for me to write this since we have the install environment.
While nothing is "Hard" in there, it's a bunch of code developed over
time. That said, if you wanted to experiment, I can share some pieces of
what we do. I just fear it's too complicated.

I will note that some customers advocate for a tiny root - say 1.5G --
that could fit in TMPFS easily and then attach in workloads (other
filesystems with development environments over the network, or container
environments, etc). That would be another way to keep memory use low for
a diskless cluster.

(we use gnfs because we're not ready to switch to ganesha yet. It's on
our list to move if we can get it working for our load).

> Yes, it isn't as reliable or certainly not high availability (power goes down,
> and the battery backup is exhausted, then the data is lost because it sat in
> RAM), but it's to solve the problems of mechanically rotating hard drives are
> too slow, NAND flash based SSDs has finite write endurance limits, and RAM
> drives, whilst in theory, faster, is also the most expensive in a $/GB basis
> compared to the other storage solutions.
>
> It's rather unfortunately that you have these different "tiers" of storage, and
> there's really nothing else in between that can help address all of these
> issues simultaneously.
>
> Thank you for sharing your thoughts.
>
> Sincerely,
>
> Ewen Chan
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on
> behalf of Erik Jacobson <erik.jacobson at hpe.com>
> Sent: March 19, 2021 11:03 AM
> To: gluster-users at gluster.org <gluster-users at gluster.org>
> Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management
>
> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
>
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
>
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
>
>
>
> We currently use gluster in two parts of cluster management.
>
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
>
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
>
>
> Use Case #1: Leader Nodes and Scale Out
> ------------------------------------------------------------------------------
> - Why?
>   * Scale out
>   * Redundancy (combined with CTDB, any leader can fail)
>   * Consistency (All servers and compute agree on what the content is)
>
> - Cluster manager has an admin or head node and zero or more leader nodes
>
> - Leader nodes are provisioned in groups of 3 to use distributed
>   replica-3 volumes (although at least one customer has interest
>   in replica-5)
>
> - We configure a few different volumes for different use cases
>
> - We use Gluster NFS still because, over a year ago, Ganesha was not
>   working with our workload and we haven't had time to re-test and
>   engage with the community. No blame - we would also owe making sure
>   our settings are right.
>
> - We use CTDB for a measure of HA and IP alias management. We use this
>   instead of pacemaker to reduce complexity.
>
> - The volume use cases are:
>   * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
>     -> Normally squashFS image files for speed/efficiency exported NFS
>     -> Expanded ("chrootable") traditional NFS trees for people who
>        prefer that, but they don't scale as well and are slower to boot
>     -> Squashfs images sit on a sharded volume while traditional gluster
>        used for expanded tree.
>   * TFTP/HTTP for network boot/PXE including miniroot
>     -> Spread across leaders too due so one node is not saturated with
>        PXE/DHCP requests
>     -> Miniroot is a "fatter initrd" that has our CM toolchain
>   * Logs/consoles
>     -> For traditional logs and consoles (HCPM also uses
>        elasticsearch/kafka/friends but we don't put that in gluster)
>     -> Separate volume to have more non-cached friendly settings
>   * 4 total volumes used (one sharded, one heavily optimized for
>     caching, one for ctdb lock, and one traditional for logging/etc)
>
> - Leader Setup
>   * Admin node installs the leaders like any other compute nodes
>   * A setup tool operates that configures gluster volumes and CTDB
>   * When ready, an admin/head node can be engaged with the leaders
>   * At that point, certain paths on the admin become gluster fuse mounts
>     and bind mounts to gluster fuse mounts.
>
> - How images are deployed (squashfs mode)
>   * User creates an image using image creation tools that make a
>     chrootable tree style image on the admin/head node
>   * mksquashfs will generate a squashfs image file on to a shared
>     storage gluster mount
>   * Nodes will mount the filesystem with the squashfs images and then
>     loop mount the squashfs as part of the boot process.
>
> - How are compute nodes tied to leaders
>   * We simply have a variable in our database where human or automated
>     discovery tools can assign a given node to a given IP alias. This
>     works better for us than trying to play routing tricks or load
>     balance tricks
>   * When leaders PXE, the DHCP response includes next-server and the
>     compute node uses the leader IP alias for the tftp/http for
>     getting the boot loader DHCP config files are on shared storage
>     to facilitate future scaling of DHCP services.
>   * ipxe or grub2 network config files then fetch the kernel, initrd
>   * initrd has a small update to load a miniroot (install environment)
>      which has more tooling
>   * Node is installed (for nodes with root disks) or does a network boot
>     cycle.
>
> - Gluster sizing
>   * We typically state compute nodes per leader but this is not for
>     gluster per-se. Squashfs image objects are very efficient and
>     probably would be fine for 2k nodes per leader. Leader nodes provide
>     other services including console logs, system logs, and monitoring
>     services.
>   * Our biggest deployment at a customer site right now has 24 leader
>     nodes. Bigger systems are coming.
>
> - Startup scripts - Getting all the gluster mounts and many bind mounts
>   used in the solution, as well as ensuring gluster mounts and ctdb lock
>   is available before ctdb start was too painful for my brian. So we
>   have systemd startup scripts that sanity test and start things
>   gracefully.
>
> - Future: The team is starting to test what a 96-leader (96 gluster
>   servers) might look like for future exascale systems.
>
> - Future: Some customers have interest in replica-5 instead of
>   replica-3. We want to test performance implications.
>
> - Future: Move to Ganesha, work with community if needed
>
> - Future: Work with Gluster community to make gluster fuse mounts
>   efficient instead of NFS (may be easier with image objects than it was
>   the old way with fully expanded trees for images!)
>
> - Challenges:
>   * Every month someone tells me gluster is going to die because of Red
>     Hat vs IBM and I have to justify things. It is getting harder.
>   * Giant squashfs images fail - mksquashfs reports error - at around
>     90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
>     the bandwidth to dig in right now but one upset customer. Work
>     arounds provided to move development environment for that customer
>     out of the operating system image.
>   * Since we have our own build and special use cases, we're on our own
>     for support (by "on our own" I mean no paid support, community help
>     only). Our complex situations can produce some cases that you guys
>     don't see and debugging them can take a month or more with the
>     volunteer nature of the community. Paying for support is harder,
>     even if it were viable politically, since we support 6 distros and 3
>     distro providers. Of course, paying for support is never the
>     preference of management. It might be an interesting thing to
>     investigate.
>   * Any gluster update is terror. We don't update much because finding a
>     gluster version that is stable for all our use cases PLUS being able to
>     test at scale which means thousands of nodes is hard. We did some
>     internal improvements here where we emulate a 2500-node-cluster
>     using virtual machines on a much smaller cluster. However, it isn't
>     ideal. So we start lagging the community over time until some
>     problem forces us to refresh. Then we tip-toe in to the update. We
>     most recently updated to gluster79 and it solved several problems
>     related to use case #2 below.
>   * Due to lack of HW, testing the SU Leader solution is hard because of
>     the number of internal clusters. However, I recently moved my
>     primary development to my beefy desktop where I have a full cluster
>     stack including 3 leader nodes with gluster running in virtual
>     machines. So we have eliminated an excuse preventing internal people
>     from playing with the solution stack.
>   * Growing volumes, replacing bricks, and replacing servers work.
>     However, the process is very involved and quirky for us. I have
>     complicated code that has to do more than I'd really like to do to
>     simply wheel in a complete leader replacement for a failed one. Even
>     with our tools, we often send up with some glusterd's not starting
>     right and have to restart a few times to get a node or brick
>     replacement going. I wish the process were just as simple as running
>     a single command or set of commands and having it do the right
>     thing.
>        -> Finding two leaders to get a complete set of peer files
>        -> Wedging in the replacement node with the UUID the node in that
>           position had before
>        -> Don't accidentally give a gluster peer it's own peer file
>        -> Then go through an involved replace brick procedure
>        -> "replace" vs " reset"
>        -> A complicated dance with XATTRs that I still don't understand
>        -> Ensuring indices and .glusterfs pre-exist with right
>           permissions
>     My hope is I just misunderstand and will bring this up in a future
>     discussion.
>
>
>
>
> Use Case #2: Admin/Head Node High Availability
> ------------------------------------------------------------------------------
> - We used to have an admin node HA solution that contained two servers
>   and an external storage device. A VM was used for the "real admin
>   node" provided by the two servers.
>
> - This solution was expensive due to the external storage
>
> - This solution was less optimal due to not having true quorum
>
> - Building on our gluster experience, we removed the external storage
>   and added a 3rd server.
>      (Due to some previous experience with DRBD we elected to stay with
>       gluster here)
>
> - We build a gluster volume shared across the 3 servers, sharded, which
>   primarily holds a virtual machine image file used by the admin node VM
>
> - The physical nodes use bonding for network redundancy and bridges to
>   feed them in to the VM.
>
> - We use pacemaker in this case to manage the setup
>
> - It's pretty simple - pacemaker rules manage a VirtualDomain instance
>   and a simple ssh monitor makes sure we can get in to it.
>
> - We typically have a 3-12T single image sitting on gluster sharded
>   shared storage used by the virtual machine, which forms the true admin
>   node.
>
> - We set this up with manual instructions but tooling is coming soon to
>   aid in automated setup of this solution
>
> - This solution is in use actively in at least 4 very large
>   supercomputers.
>
> - I am impressed by the speed of the solution on the sharded volume. The
>   VM image creation speed using libvirt to talk to the image file hosted
>   on a sharded gluster volume works slick.
>     (We use the fuse mount because we don't want to build our own
>      qemu/libvirt, which would be needed at least for SLES and maybe
>      RHEL too since we have our own gluster build)
>
> - Challenges:
>   * Not being able to boot all of a sudden was a terror for us (where
>     the VM would only see the disk size as the size of a shard at random
>     times).
>     -> Thankfully community helped guide us to gluster79 and that resolved it
>
>   * People keep asking to make a 2-node version but I push back.
>     Especially with gluster but honestly with other solutions too, don't
>     cheap out with arbitration is what I try to tell people.
>
>
> Network Utilization
> ------------------------------------------------------------------------------
> - We tyhpically have 2x 10G bonds on leaders
> - For simplicity, we co-mingle gluster and compute node traffic together
> - It is very rare that we approach 20G full utilization even in very
>   intensive operations
> - Newer solutions may increase the speed of the bonds, but it isn't due
>   to a current pressing need.
> - Locality Challenge:
>    * For future Exascale systems, we may need to become concerned about
>      locality
>    * Certain compute nodes are far closer to some gluster servers than
>      others
>    * And gluster servers themselves need to talk among themselves but
>      could be stretched across the topology
>    * We have tools to monitor switch link utilization and so far have
>      not hit a scary zone
>    * Somewhat complicated by fault tolerance. It would be sad to design
>      the leaders such that a PDU goes bad so you lose qourum because
>      3 leaders in the same replica-3 were on the same PDU
>    * But spreading them out may have locality implications
>    * This is a future area of study for us. We have various ideas in
>      mind if a problem strikes.
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210319/36c4a78a/attachment.html>