[Gluster-users] Gluster usage scenarios in HPC cluster management
Olaf Buitelaar
olaf.buitelaar at gmail.com
Mon Apr 5 15:54:12 UTC 2021
Hi Erik,
Thanks for sharing your unique use-case and solution. It was very
interesting to read your write-up.
I agree with your point; " * Growing volumes, replacing bricks, and
replacing servers work.
However, the process is very involved and quirky for us. I have....."
in your use-case 1 last point.
I do seem to suffer from similar issues where glusterd just doesn't want to
start up correctly at first time, maybe also see:
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
one possible cause for not starting up correctly.
And secondly indeed for a "molten to the floor" server it would be great if
gluster instead of the current replace-brick commands, would have something
to replace a complete host, and would just recreate the all bricks (or
better yet all missing bricks, in case some survived from another
RAID/disk), which it would expect on that host, and would heal them all.
right now indeed this process is quite involved, and sometimes feels a bit
like performing black magic.
Best Olaf
Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson <erik.jacobson at hpe.com>:
> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
>
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
>
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
>
>
>
> We currently use gluster in two parts of cluster management.
>
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
>
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
>
>
> Use Case #1: Leader Nodes and Scale Out
>
> ------------------------------------------------------------------------------
> - Why?
> * Scale out
> * Redundancy (combined with CTDB, any leader can fail)
> * Consistency (All servers and compute agree on what the content is)
>
> - Cluster manager has an admin or head node and zero or more leader nodes
>
> - Leader nodes are provisioned in groups of 3 to use distributed
> replica-3 volumes (although at least one customer has interest
> in replica-5)
>
> - We configure a few different volumes for different use cases
>
> - We use Gluster NFS still because, over a year ago, Ganesha was not
> working with our workload and we haven't had time to re-test and
> engage with the community. No blame - we would also owe making sure
> our settings are right.
>
> - We use CTDB for a measure of HA and IP alias management. We use this
> instead of pacemaker to reduce complexity.
>
> - The volume use cases are:
> * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
> -> Normally squashFS image files for speed/efficiency exported NFS
> -> Expanded ("chrootable") traditional NFS trees for people who
> prefer that, but they don't scale as well and are slower to boot
> -> Squashfs images sit on a sharded volume while traditional gluster
> used for expanded tree.
> * TFTP/HTTP for network boot/PXE including miniroot
> -> Spread across leaders too due so one node is not saturated with
> PXE/DHCP requests
> -> Miniroot is a "fatter initrd" that has our CM toolchain
> * Logs/consoles
> -> For traditional logs and consoles (HCPM also uses
> elasticsearch/kafka/friends but we don't put that in gluster)
> -> Separate volume to have more non-cached friendly settings
> * 4 total volumes used (one sharded, one heavily optimized for
> caching, one for ctdb lock, and one traditional for logging/etc)
>
> - Leader Setup
> * Admin node installs the leaders like any other compute nodes
> * A setup tool operates that configures gluster volumes and CTDB
> * When ready, an admin/head node can be engaged with the leaders
> * At that point, certain paths on the admin become gluster fuse mounts
> and bind mounts to gluster fuse mounts.
>
> - How images are deployed (squashfs mode)
> * User creates an image using image creation tools that make a
> chrootable tree style image on the admin/head node
> * mksquashfs will generate a squashfs image file on to a shared
> storage gluster mount
> * Nodes will mount the filesystem with the squashfs images and then
> loop mount the squashfs as part of the boot process.
>
> - How are compute nodes tied to leaders
> * We simply have a variable in our database where human or automated
> discovery tools can assign a given node to a given IP alias. This
> works better for us than trying to play routing tricks or load
> balance tricks
> * When leaders PXE, the DHCP response includes next-server and the
> compute node uses the leader IP alias for the tftp/http for
> getting the boot loader DHCP config files are on shared storage
> to facilitate future scaling of DHCP services.
> * ipxe or grub2 network config files then fetch the kernel, initrd
> * initrd has a small update to load a miniroot (install environment)
> which has more tooling
> * Node is installed (for nodes with root disks) or does a network boot
> cycle.
>
> - Gluster sizing
> * We typically state compute nodes per leader but this is not for
> gluster per-se. Squashfs image objects are very efficient and
> probably would be fine for 2k nodes per leader. Leader nodes provide
> other services including console logs, system logs, and monitoring
> services.
> * Our biggest deployment at a customer site right now has 24 leader
> nodes. Bigger systems are coming.
>
> - Startup scripts - Getting all the gluster mounts and many bind mounts
> used in the solution, as well as ensuring gluster mounts and ctdb lock
> is available before ctdb start was too painful for my brian. So we
> have systemd startup scripts that sanity test and start things
> gracefully.
>
> - Future: The team is starting to test what a 96-leader (96 gluster
> servers) might look like for future exascale systems.
>
> - Future: Some customers have interest in replica-5 instead of
> replica-3. We want to test performance implications.
>
> - Future: Move to Ganesha, work with community if needed
>
> - Future: Work with Gluster community to make gluster fuse mounts
> efficient instead of NFS (may be easier with image objects than it was
> the old way with fully expanded trees for images!)
>
> - Challenges:
> * Every month someone tells me gluster is going to die because of Red
> Hat vs IBM and I have to justify things. It is getting harder.
> * Giant squashfs images fail - mksquashfs reports error - at around
> 90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
> the bandwidth to dig in right now but one upset customer. Work
> arounds provided to move development environment for that customer
> out of the operating system image.
> * Since we have our own build and special use cases, we're on our own
> for support (by "on our own" I mean no paid support, community help
> only). Our complex situations can produce some cases that you guys
> don't see and debugging them can take a month or more with the
> volunteer nature of the community. Paying for support is harder,
> even if it were viable politically, since we support 6 distros and 3
> distro providers. Of course, paying for support is never the
> preference of management. It might be an interesting thing to
> investigate.
> * Any gluster update is terror. We don't update much because finding a
> gluster version that is stable for all our use cases PLUS being able to
> test at scale which means thousands of nodes is hard. We did some
> internal improvements here where we emulate a 2500-node-cluster
> using virtual machines on a much smaller cluster. However, it isn't
> ideal. So we start lagging the community over time until some
> problem forces us to refresh. Then we tip-toe in to the update. We
> most recently updated to gluster79 and it solved several problems
> related to use case #2 below.
> * Due to lack of HW, testing the SU Leader solution is hard because of
> the number of internal clusters. However, I recently moved my
> primary development to my beefy desktop where I have a full cluster
> stack including 3 leader nodes with gluster running in virtual
> machines. So we have eliminated an excuse preventing internal people
> from playing with the solution stack.
> * Growing volumes, replacing bricks, and replacing servers work.
> However, the process is very involved and quirky for us. I have
> complicated code that has to do more than I'd really like to do to
> simply wheel in a complete leader replacement for a failed one. Even
> with our tools, we often send up with some glusterd's not starting
> right and have to restart a few times to get a node or brick
> replacement going. I wish the process were just as simple as running
> a single command or set of commands and having it do the right
> thing.
> -> Finding two leaders to get a complete set of peer files
> -> Wedging in the replacement node with the UUID the node in that
> position had before
> -> Don't accidentally give a gluster peer it's own peer file
> -> Then go through an involved replace brick procedure
> -> "replace" vs " reset"
> -> A complicated dance with XATTRs that I still don't understand
> -> Ensuring indices and .glusterfs pre-exist with right
> permissions
> My hope is I just misunderstand and will bring this up in a future
> discussion.
>
>
>
>
> Use Case #2: Admin/Head Node High Availability
>
> ------------------------------------------------------------------------------
> - We used to have an admin node HA solution that contained two servers
> and an external storage device. A VM was used for the "real admin
> node" provided by the two servers.
>
> - This solution was expensive due to the external storage
>
> - This solution was less optimal due to not having true quorum
>
> - Building on our gluster experience, we removed the external storage
> and added a 3rd server.
> (Due to some previous experience with DRBD we elected to stay with
> gluster here)
>
> - We build a gluster volume shared across the 3 servers, sharded, which
> primarily holds a virtual machine image file used by the admin node VM
>
> - The physical nodes use bonding for network redundancy and bridges to
> feed them in to the VM.
>
> - We use pacemaker in this case to manage the setup
>
> - It's pretty simple - pacemaker rules manage a VirtualDomain instance
> and a simple ssh monitor makes sure we can get in to it.
>
> - We typically have a 3-12T single image sitting on gluster sharded
> shared storage used by the virtual machine, which forms the true admin
> node.
>
> - We set this up with manual instructions but tooling is coming soon to
> aid in automated setup of this solution
>
> - This solution is in use actively in at least 4 very large
> supercomputers.
>
> - I am impressed by the speed of the solution on the sharded volume. The
> VM image creation speed using libvirt to talk to the image file hosted
> on a sharded gluster volume works slick.
> (We use the fuse mount because we don't want to build our own
> qemu/libvirt, which would be needed at least for SLES and maybe
> RHEL too since we have our own gluster build)
>
> - Challenges:
> * Not being able to boot all of a sudden was a terror for us (where
> the VM would only see the disk size as the size of a shard at random
> times).
> -> Thankfully community helped guide us to gluster79 and that resolved
> it
>
> * People keep asking to make a 2-node version but I push back.
> Especially with gluster but honestly with other solutions too, don't
> cheap out with arbitration is what I try to tell people.
>
>
> Network Utilization
>
> ------------------------------------------------------------------------------
> - We tyhpically have 2x 10G bonds on leaders
> - For simplicity, we co-mingle gluster and compute node traffic together
> - It is very rare that we approach 20G full utilization even in very
> intensive operations
> - Newer solutions may increase the speed of the bonds, but it isn't due
> to a current pressing need.
> - Locality Challenge:
> * For future Exascale systems, we may need to become concerned about
> locality
> * Certain compute nodes are far closer to some gluster servers than
> others
> * And gluster servers themselves need to talk among themselves but
> could be stretched across the topology
> * We have tools to monitor switch link utilization and so far have
> not hit a scary zone
> * Somewhat complicated by fault tolerance. It would be sad to design
> the leaders such that a PDU goes bad so you lose qourum because
> 3 leaders in the same replica-3 were on the same PDU
> * But spreading them out may have locality implications
> * This is a future area of study for us. We have various ideas in
> mind if a problem strikes.
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210405/17bad49e/attachment.html>
More information about the Gluster-users
mailing list