[Gluster-users] Gluster usage scenarios in HPC cluster management

Tue Aug 23 06:33:10 UTC 2022

Hi All,
Adding my two cents, We have two kinds of storage First based on SSD 600TB
and 2nd on spinning disks. 20PB.

for SSD i tried glusterfs ,beefs and compare them with lustre . Beefs
failed because of ln (hard link issues) at that time (2019) . Glusterfs was
very promising but somehow the lustre with zfs showed good performance.

on our 20PB we are running lustre based on zfs and happy with it.

I am trying to write a whitepaper with different parameters to tweak the
lustre performance from disks, to zfs to lustre . This could help the HPC
community, especially in life science.

/Zee
Section head of IT Infrastructure,
Centre for Genomic Medicine , KFSHRC , Riyadh

On Mon, Apr 5, 2021 at 6:54 PM Olaf Buitelaar <olaf.buitelaar at gmail.com>
wrote:

> Hi Erik,
>
> Thanks for sharing your unique use-case and solution. It was very
> interesting to read your write-up.
>
> I agree with your point; " * Growing volumes, replacing bricks, and
> replacing servers work.
>     However, the process is very involved and quirky for us. I have....."
> in your use-case 1 last point.
>
> I do seem to suffer from similar issues where glusterd just doesn't want
> to start up correctly at first time, maybe also see:
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
> one possible cause for not starting up correctly.
> And secondly indeed for a "molten to the floor" server it would be great
> if gluster instead of the current replace-brick commands, would have
> something to replace a complete host, and would just recreate the all
> bricks (or better yet all missing bricks, in case some survived from
> another RAID/disk), which it would expect on that host, and would heal them
> all.
> right now indeed this process is quite involved, and sometimes feels a bit
> like performing black magic.
>
> Best Olaf
>
> Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson <erik.jacobson at hpe.com>:
>
>> A while back I was asked to make a blog or something similar to discuss
>> the use cases the team I work on (HPCM cluster management) at HPE.
>>
>> If you are not interested in reading about what I'm up to, just delete
>> this and move on.
>>
>> I really don't have a public blogging mechanism so I'll just describe
>> what we're up to here. Some of this was posted in some form in the past.
>> Since this contains the raw materials, I could make a wiki-ized version
>> if there were a public place to put it.
>>
>>
>>
>> We currently use gluster in two parts of cluster management.
>>
>> In fact, gluster in our management node infrastructure is helping us to
>> provide scaling and consistency to some of the largest clusters in the
>> world, clusters in the TOP100 list. While I can get in to trouble by
>> sharing too much, I will just say that trends are continuing and the
>> future may have some exciting announcements on where on TOP100 certain
>> new giant systems may end up in the coming 1-2 years.
>>
>> At HPE, HPCM is the "traditional cluster manager." There is another team
>> that develops a more cloud-like solution and I am not discussing that
>> solution here.
>>
>>
>> Use Case #1: Leader Nodes and Scale Out
>>
>> ------------------------------------------------------------------------------
>> - Why?
>>   * Scale out
>>   * Redundancy (combined with CTDB, any leader can fail)
>>   * Consistency (All servers and compute agree on what the content is)
>>
>> - Cluster manager has an admin or head node and zero or more leader nodes
>>
>> - Leader nodes are provisioned in groups of 3 to use distributed
>>   replica-3 volumes (although at least one customer has interest
>>   in replica-5)
>>
>> - We configure a few different volumes for different use cases
>>
>> - We use Gluster NFS still because, over a year ago, Ganesha was not
>>   working with our workload and we haven't had time to re-test and
>>   engage with the community. No blame - we would also owe making sure
>>   our settings are right.
>>
>> - We use CTDB for a measure of HA and IP alias management. We use this
>>   instead of pacemaker to reduce complexity.
>>
>> - The volume use cases are:
>>   * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
>>     -> Normally squashFS image files for speed/efficiency exported NFS
>>     -> Expanded ("chrootable") traditional NFS trees for people who
>>        prefer that, but they don't scale as well and are slower to boot
>>     -> Squashfs images sit on a sharded volume while traditional gluster
>>        used for expanded tree.
>>   * TFTP/HTTP for network boot/PXE including miniroot
>>     -> Spread across leaders too due so one node is not saturated with
>>        PXE/DHCP requests
>>     -> Miniroot is a "fatter initrd" that has our CM toolchain
>>   * Logs/consoles
>>     -> For traditional logs and consoles (HCPM also uses
>>        elasticsearch/kafka/friends but we don't put that in gluster)
>>     -> Separate volume to have more non-cached friendly settings
>>   * 4 total volumes used (one sharded, one heavily optimized for
>>     caching, one for ctdb lock, and one traditional for logging/etc)
>>
>> - Leader Setup
>>   * Admin node installs the leaders like any other compute nodes
>>   * A setup tool operates that configures gluster volumes and CTDB
>>   * When ready, an admin/head node can be engaged with the leaders
>>   * At that point, certain paths on the admin become gluster fuse mounts
>>     and bind mounts to gluster fuse mounts.
>>
>> - How images are deployed (squashfs mode)
>>   * User creates an image using image creation tools that make a
>>     chrootable tree style image on the admin/head node
>>   * mksquashfs will generate a squashfs image file on to a shared
>>     storage gluster mount
>>   * Nodes will mount the filesystem with the squashfs images and then
>>     loop mount the squashfs as part of the boot process.
>>
>> - How are compute nodes tied to leaders
>>   * We simply have a variable in our database where human or automated
>>     discovery tools can assign a given node to a given IP alias. This
>>     works better for us than trying to play routing tricks or load
>>     balance tricks
>>   * When leaders PXE, the DHCP response includes next-server and the
>>     compute node uses the leader IP alias for the tftp/http for
>>     getting the boot loader DHCP config files are on shared storage
>>     to facilitate future scaling of DHCP services.
>>   * ipxe or grub2 network config files then fetch the kernel, initrd
>>   * initrd has a small update to load a miniroot (install environment)
>>      which has more tooling
>>   * Node is installed (for nodes with root disks) or does a network boot
>>     cycle.
>>
>> - Gluster sizing
>>   * We typically state compute nodes per leader but this is not for
>>     gluster per-se. Squashfs image objects are very efficient and
>>     probably would be fine for 2k nodes per leader. Leader nodes provide
>>     other services including console logs, system logs, and monitoring
>>     services.
>>   * Our biggest deployment at a customer site right now has 24 leader
>>     nodes. Bigger systems are coming.
>>
>> - Startup scripts - Getting all the gluster mounts and many bind mounts
>>   used in the solution, as well as ensuring gluster mounts and ctdb lock
>>   is available before ctdb start was too painful for my brian. So we
>>   have systemd startup scripts that sanity test and start things
>>   gracefully.
>>
>> - Future: The team is starting to test what a 96-leader (96 gluster
>>   servers) might look like for future exascale systems.
>>
>> - Future: Some customers have interest in replica-5 instead of
>>   replica-3. We want to test performance implications.
>>
>> - Future: Move to Ganesha, work with community if needed
>>
>> - Future: Work with Gluster community to make gluster fuse mounts
>>   efficient instead of NFS (may be easier with image objects than it was
>>   the old way with fully expanded trees for images!)
>>
>> - Challenges:
>>   * Every month someone tells me gluster is going to die because of Red
>>     Hat vs IBM and I have to justify things. It is getting harder.
>>   * Giant squashfs images fail - mksquashfs reports error - at around
>>     90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have
>>     the bandwidth to dig in right now but one upset customer. Work
>>     arounds provided to move development environment for that customer
>>     out of the operating system image.
>>   * Since we have our own build and special use cases, we're on our own
>>     for support (by "on our own" I mean no paid support, community help
>>     only). Our complex situations can produce some cases that you guys
>>     don't see and debugging them can take a month or more with the
>>     volunteer nature of the community. Paying for support is harder,
>>     even if it were viable politically, since we support 6 distros and 3
>>     distro providers. Of course, paying for support is never the
>>     preference of management. It might be an interesting thing to
>>     investigate.
>>   * Any gluster update is terror. We don't update much because finding a
>>     gluster version that is stable for all our use cases PLUS being able
>> to
>>     test at scale which means thousands of nodes is hard. We did some
>>     internal improvements here where we emulate a 2500-node-cluster
>>     using virtual machines on a much smaller cluster. However, it isn't
>>     ideal. So we start lagging the community over time until some
>>     problem forces us to refresh. Then we tip-toe in to the update. We
>>     most recently updated to gluster79 and it solved several problems
>>     related to use case #2 below.
>>   * Due to lack of HW, testing the SU Leader solution is hard because of
>>     the number of internal clusters. However, I recently moved my
>>     primary development to my beefy desktop where I have a full cluster
>>     stack including 3 leader nodes with gluster running in virtual
>>     machines. So we have eliminated an excuse preventing internal people
>>     from playing with the solution stack.
>>   * Growing volumes, replacing bricks, and replacing servers work.
>>     However, the process is very involved and quirky for us. I have
>>     complicated code that has to do more than I'd really like to do to
>>     simply wheel in a complete leader replacement for a failed one. Even
>>     with our tools, we often send up with some glusterd's not starting
>>     right and have to restart a few times to get a node or brick
>>     replacement going. I wish the process were just as simple as running
>>     a single command or set of commands and having it do the right
>>     thing.
>>        -> Finding two leaders to get a complete set of peer files
>>        -> Wedging in the replacement node with the UUID the node in that
>>           position had before
>>        -> Don't accidentally give a gluster peer it's own peer file
>>        -> Then go through an involved replace brick procedure
>>        -> "replace" vs " reset"
>>        -> A complicated dance with XATTRs that I still don't understand
>>        -> Ensuring indices and .glusterfs pre-exist with right
>>           permissions
>>     My hope is I just misunderstand and will bring this up in a future
>>     discussion.
>>
>>
>>
>>
>> Use Case #2: Admin/Head Node High Availability
>>
>> ------------------------------------------------------------------------------
>> - We used to have an admin node HA solution that contained two servers
>>   and an external storage device. A VM was used for the "real admin
>>   node" provided by the two servers.
>>
>> - This solution was expensive due to the external storage
>>
>> - This solution was less optimal due to not having true quorum
>>
>> - Building on our gluster experience, we removed the external storage
>>   and added a 3rd server.
>>      (Due to some previous experience with DRBD we elected to stay with
>>       gluster here)
>>
>> - We build a gluster volume shared across the 3 servers, sharded, which
>>   primarily holds a virtual machine image file used by the admin node VM
>>
>> - The physical nodes use bonding for network redundancy and bridges to
>>   feed them in to the VM.
>>
>> - We use pacemaker in this case to manage the setup
>>
>> - It's pretty simple - pacemaker rules manage a VirtualDomain instance
>>   and a simple ssh monitor makes sure we can get in to it.
>>
>> - We typically have a 3-12T single image sitting on gluster sharded
>>   shared storage used by the virtual machine, which forms the true admin
>>   node.
>>
>> - We set this up with manual instructions but tooling is coming soon to
>>   aid in automated setup of this solution
>>
>> - This solution is in use actively in at least 4 very large
>>   supercomputers.
>>
>> - I am impressed by the speed of the solution on the sharded volume. The
>>   VM image creation speed using libvirt to talk to the image file hosted
>>   on a sharded gluster volume works slick.
>>     (We use the fuse mount because we don't want to build our own
>>      qemu/libvirt, which would be needed at least for SLES and maybe
>>      RHEL too since we have our own gluster build)
>>
>> - Challenges:
>>   * Not being able to boot all of a sudden was a terror for us (where
>>     the VM would only see the disk size as the size of a shard at random
>>     times).
>>     -> Thankfully community helped guide us to gluster79 and that
>> resolved it
>>
>>   * People keep asking to make a 2-node version but I push back.
>>     Especially with gluster but honestly with other solutions too, don't
>>     cheap out with arbitration is what I try to tell people.
>>
>>
>> Network Utilization
>>
>> ------------------------------------------------------------------------------
>> - We tyhpically have 2x 10G bonds on leaders
>> - For simplicity, we co-mingle gluster and compute node traffic together
>> - It is very rare that we approach 20G full utilization even in very
>>   intensive operations
>> - Newer solutions may increase the speed of the bonds, but it isn't due
>>   to a current pressing need.
>> - Locality Challenge:
>>    * For future Exascale systems, we may need to become concerned about
>>      locality
>>    * Certain compute nodes are far closer to some gluster servers than
>>      others
>>    * And gluster servers themselves need to talk among themselves but
>>      could be stretched across the topology
>>    * We have tools to monitor switch link utilization and so far have
>>      not hit a scary zone
>>    * Somewhat complicated by fault tolerance. It would be sad to design
>>      the leaders such that a PDU goes bad so you lose qourum because
>>      3 leaders in the same replica-3 were on the same PDU
>>    * But spreading them out may have locality implications
>>    * This is a future area of study for us. We have various ideas in
>>      mind if a problem strikes.
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220823/eef69889/attachment.html>