[Gluster-users] Gluster usage scenarios in HPC cluster management
Zeeshan Ali Shah
javaclinic at gmail.com
Tue Aug 23 06:33:10 UTC 2022
Hi All,
Adding my two cents, We have two kinds of storage First based on SSD 600TB
and 2nd on spinning disks. 20PB.
for SSD i tried glusterfs ,beefs and compare them with lustre . Beefs
failed because of ln (hard link issues) at that time (2019) . Glusterfs was
very promising but somehow the lustre with zfs showed good performance.
on our 20PB we are running lustre based on zfs and happy with it.
I am trying to write a whitepaper with different parameters to tweak the
lustre performance from disks, to zfs to lustre . This could help the HPC
community, especially in life science.
/Zee
Section head of IT Infrastructure,
Centre for Genomic Medicine , KFSHRC , Riyadh
On Mon, Apr 5, 2021 at 6:54 PM Olaf Buitelaar <olaf.buitelaar at gmail.com>
wrote:
> Hi Erik,
>
> Thanks for sharing your unique use-case and solution. It was very
> interesting to read your write-up.
>
> I agree with your point; " * Growing volumes, replacing bricks, and
> replacing servers work.
> However, the process is very involved and quirky for us. I have....."
> in your use-case 1 last point.
>
> I do seem to suffer from similar issues where glusterd just doesn't want
> to start up correctly at first time, maybe also see:
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
> one possible cause for not starting up correctly.
> And secondly indeed for a "molten to the floor" server it would be great
> if gluster instead of the current replace-brick commands, would have
> something to replace a complete host, and would just recreate the all
> bricks (or better yet all missing bricks, in case some survived from
> another RAID/disk), which it would expect on that host, and would heal them
> all.
> right now indeed this process is quite involved, and sometimes feels a bit
> like performing black magic.
>
> Best Olaf
>
> Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson <erik.jacobson at hpe.com>:
>
>> A while back I was asked to make a blog or something similar to discuss
>> the use cases the team I work on (HPCM cluster management) at HPE.
>>
>> If you are not interested in reading about what I'm up to, just delete
>> this and move on.
>>
>> I really don't have a public blogging mechanism so I'll just describe
>> what we're up to here. Some of this was posted in some form in the past.
>> Since this contains the raw materials, I could make a wiki-ized version
>> if there were a public place to put it.
>>
>>
>>
>> We currently use gluster in two parts of cluster management.
>>
>> In fact, gluster in our management node infrastructure is helping us to
>> provide scaling and consistency to some of the largest clusters in the
>> world, clusters in the TOP100 list. While I can get in to trouble by
>> sharing too much, I will just say that trends are continuing and the
>> future may have some exciting announcements on where on TOP100 certain
>> new giant systems may end up in the coming 1-2 years.
>>
>> At HPE, HPCM is the "traditional cluster manager." There is another team
>> that develops a more cloud-like solution and I am not discussing that
>> solution here.
>>
>>
>> Use Case #1: Leader Nodes and Scale Out
>>
>> ------------------------------------------------------------------------------
>> - Why?
>> * Scale out
>> * Redundancy (combined with CTDB, any leader can fail)
>> * Consistency (All servers and compute agree on what the content is)
>>
>> - Cluster manager has an admin or head node and zero or more leader nodes
>>
>> - Leader nodes are provisioned in groups of 3 to use distributed
>> replica-3 volumes (although at least one customer has interest
>> in replica-5)
>>
>> - We configure a few different volumes for different use cases
>>
>> - We use Gluster NFS still because, over a year ago, Ganesha was not
>> working with our workload and we haven't had time to re-test and
>> engage with the community. No blame - we would also owe making sure
>> our settings are right.
>>
>> - We use CTDB for a measure of HA and IP alias management. We use this
>> instead of pacemaker to reduce complexity.
>>
>> - The volume use cases are:
>> * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
>> -> Normally squashFS image files for speed/efficiency exported NFS
>> -> Expanded ("chrootable") traditional NFS trees for people who
>> prefer that, but they don't scale as well and are slower to boot
>> -> Squashfs images sit on a sharded volume while traditional gluster
>> used for expanded tree.
>> * TFTP/HTTP for network boot/PXE including miniroot
>> -> Spread across leaders too due so one node is not saturated with
>> PXE/DHCP requests
>> -> Miniroot is a "fatter initrd" that has our CM toolchain
>> * Logs/consoles
>> -> For traditional logs and consoles (HCPM also uses
>> elasticsearch/kafka/friends but we don't put that in gluster)
>> -> Separate volume to have more non-cached friendly settings
>> * 4 total volumes used (one sharded, one heavily optimized for
>> caching, one for ctdb lock, and one traditional for logging/etc)
>>
>> - Leader Setup
>> * Admin node installs the leaders like any other compute nodes
>> * A setup tool operates that configures gluster volumes and CTDB
>> * When ready, an admin/head node can be engaged with the leaders
>> * At that point, certain paths on the admin become gluster fuse mounts
>> and bind mounts to gluster fuse mounts.
>>
>> - How images are deployed (squashfs mode)
>> * User creates an image using image creation tools that make a
>> chrootable tree style image on the admin/head node
>> * mksquashfs will generate a squashfs image file on to a shared
>> storage gluster mount
>> * Nodes will mount the filesystem with the squashfs images and then
>> loop mount the squashfs as part of the boot process.
>>
>> - How are compute nodes tied to leaders
>> * We simply have a variable in our database where human or automated
>> discovery tools can assign a given node to a given IP alias. This
>> works better for us than trying to play routing tricks or load
>> balance tricks
>> * When leaders PXE, the DHCP response includes next-server and the
>> compute node uses the leader IP alias for the tftp/http for
>> getting the boot loader DHCP config files are on shared storage
>> to facilitate future scaling of DHCP services.
>> * ipxe or grub2 network config files then fetch the kernel, initrd
>> * initrd has a small update to load a miniroot (install environment)
>> which has more tooling
>> * Node is installed (for nodes with root disks) or does a network boot
>> cycle.
>>
>> - Gluster sizing
>> * We typically state compute nodes per leader but this is not for
>> gluster per-se. Squashfs image objects are very efficient and
>> probably would be fine for 2k nodes per leader. Leader nodes provide
>> other services including console logs, system logs, and monitoring
>> services.
>> * Our biggest deployment at a customer site right now has 24 leader
>> nodes. Bigger systems are coming.
>>
>> - Startup scripts - Getting all the gluster mounts and many bind mounts
>> used in the solution, as well as ensuring gluster mounts and ctdb lock
>> is available before ctdb start was too painful for my brian. So we
>> have systemd startup scripts that sanity test and start things
>> gracefully.
>>
>> - Future: The team is starting to test what a 96-leader (96 gluster
>> servers) might look like for future exascale systems.
>>
>> - Future: Some customers have interest in replica-5 instead of
>> replica-3. We want to test performance implications.
>>
>> - Future: Move to Ganesha, work with community if needed
>>
>> - Future: Work with Gluster community to make gluster fuse mounts
>> efficient instead of NFS (may be easier with image objects than it was
>> the old way with fully expanded trees for images!)
>>
>> - Challenges:
>> * Every month someone tells me gluster is going to die because of Red
>> Hat vs IBM and I have to justify things. It is getting harder.
>> * Giant squashfs images fail - mksquashfs reports error - at around
>> 90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
>> the bandwidth to dig in right now but one upset customer. Work
>> arounds provided to move development environment for that customer
>> out of the operating system image.
>> * Since we have our own build and special use cases, we're on our own
>> for support (by "on our own" I mean no paid support, community help
>> only). Our complex situations can produce some cases that you guys
>> don't see and debugging them can take a month or more with the
>> volunteer nature of the community. Paying for support is harder,
>> even if it were viable politically, since we support 6 distros and 3
>> distro providers. Of course, paying for support is never the
>> preference of management. It might be an interesting thing to
>> investigate.
>> * Any gluster update is terror. We don't update much because finding a
>> gluster version that is stable for all our use cases PLUS being able
>> to
>> test at scale which means thousands of nodes is hard. We did some
>> internal improvements here where we emulate a 2500-node-cluster
>> using virtual machines on a much smaller cluster. However, it isn't
>> ideal. So we start lagging the community over time until some
>> problem forces us to refresh. Then we tip-toe in to the update. We
>> most recently updated to gluster79 and it solved several problems
>> related to use case #2 below.
>> * Due to lack of HW, testing the SU Leader solution is hard because of
>> the number of internal clusters. However, I recently moved my
>> primary development to my beefy desktop where I have a full cluster
>> stack including 3 leader nodes with gluster running in virtual
>> machines. So we have eliminated an excuse preventing internal people
>> from playing with the solution stack.
>> * Growing volumes, replacing bricks, and replacing servers work.
>> However, the process is very involved and quirky for us. I have
>> complicated code that has to do more than I'd really like to do to
>> simply wheel in a complete leader replacement for a failed one. Even
>> with our tools, we often send up with some glusterd's not starting
>> right and have to restart a few times to get a node or brick
>> replacement going. I wish the process were just as simple as running
>> a single command or set of commands and having it do the right
>> thing.
>> -> Finding two leaders to get a complete set of peer files
>> -> Wedging in the replacement node with the UUID the node in that
>> position had before
>> -> Don't accidentally give a gluster peer it's own peer file
>> -> Then go through an involved replace brick procedure
>> -> "replace" vs " reset"
>> -> A complicated dance with XATTRs that I still don't understand
>> -> Ensuring indices and .glusterfs pre-exist with right
>> permissions
>> My hope is I just misunderstand and will bring this up in a future
>> discussion.
>>
>>
>>
>>
>> Use Case #2: Admin/Head Node High Availability
>>
>> ------------------------------------------------------------------------------
>> - We used to have an admin node HA solution that contained two servers
>> and an external storage device. A VM was used for the "real admin
>> node" provided by the two servers.
>>
>> - This solution was expensive due to the external storage
>>
>> - This solution was less optimal due to not having true quorum
>>
>> - Building on our gluster experience, we removed the external storage
>> and added a 3rd server.
>> (Due to some previous experience with DRBD we elected to stay with
>> gluster here)
>>
>> - We build a gluster volume shared across the 3 servers, sharded, which
>> primarily holds a virtual machine image file used by the admin node VM
>>
>> - The physical nodes use bonding for network redundancy and bridges to
>> feed them in to the VM.
>>
>> - We use pacemaker in this case to manage the setup
>>
>> - It's pretty simple - pacemaker rules manage a VirtualDomain instance
>> and a simple ssh monitor makes sure we can get in to it.
>>
>> - We typically have a 3-12T single image sitting on gluster sharded
>> shared storage used by the virtual machine, which forms the true admin
>> node.
>>
>> - We set this up with manual instructions but tooling is coming soon to
>> aid in automated setup of this solution
>>
>> - This solution is in use actively in at least 4 very large
>> supercomputers.
>>
>> - I am impressed by the speed of the solution on the sharded volume. The
>> VM image creation speed using libvirt to talk to the image file hosted
>> on a sharded gluster volume works slick.
>> (We use the fuse mount because we don't want to build our own
>> qemu/libvirt, which would be needed at least for SLES and maybe
>> RHEL too since we have our own gluster build)
>>
>> - Challenges:
>> * Not being able to boot all of a sudden was a terror for us (where
>> the VM would only see the disk size as the size of a shard at random
>> times).
>> -> Thankfully community helped guide us to gluster79 and that
>> resolved it
>>
>> * People keep asking to make a 2-node version but I push back.
>> Especially with gluster but honestly with other solutions too, don't
>> cheap out with arbitration is what I try to tell people.
>>
>>
>> Network Utilization
>>
>> ------------------------------------------------------------------------------
>> - We tyhpically have 2x 10G bonds on leaders
>> - For simplicity, we co-mingle gluster and compute node traffic together
>> - It is very rare that we approach 20G full utilization even in very
>> intensive operations
>> - Newer solutions may increase the speed of the bonds, but it isn't due
>> to a current pressing need.
>> - Locality Challenge:
>> * For future Exascale systems, we may need to become concerned about
>> locality
>> * Certain compute nodes are far closer to some gluster servers than
>> others
>> * And gluster servers themselves need to talk among themselves but
>> could be stretched across the topology
>> * We have tools to monitor switch link utilization and so far have
>> not hit a scary zone
>> * Somewhat complicated by fault tolerance. It would be sad to design
>> the leaders such that a PDU goes bad so you lose qourum because
>> 3 leaders in the same replica-3 were on the same PDU
>> * But spreading them out may have locality implications
>> * This is a future area of study for us. We have various ideas in
>> mind if a problem strikes.
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220823/eef69889/attachment.html>
More information about the Gluster-users
mailing list