<div dir="ltr">Hi Erik,<div><br></div><div>Thanks for sharing your unique use-case and solution. It was very interesting to read your write-up.</div><div><br></div><div>I agree with your point; &quot; * Growing volumes, replacing bricks, and replacing servers work.</div>    However, the process is very involved and quirky for us. I have.....&quot; in your use-case 1 last point.<div><br></div><div>I do seem to suffer from similar issues where glusterd just doesn&#39;t want to start up correctly at first time, maybe also see: <a href="https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html" target="_blank">https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html</a> as one possible cause for not starting up correctly.</div><div>And secondly indeed for a &quot;molten to the floor&quot; server it would be great if gluster instead of the current replace-brick commands, would have something to replace a complete host, and would just recreate the all bricks (or better yet all missing bricks, in case some survived from another RAID/disk), which it would expect on that host, and would heal them all.</div><div>right now indeed this process is quite involved, and sometimes feels a bit like performing black magic.</div><div><br></div><div>Best Olaf</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson &lt;<a href="mailto:erik.jacobson@hpe.com">erik.jacobson@hpe.com</a>&gt;:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">A while back I was asked to make a blog or something similar to discuss<br>

the use cases the team I work on (HPCM cluster management) at HPE.<br>

<br>

If you are not interested in reading about what I&#39;m up to, just delete<br>

this and move on.<br>

<br>

I really don&#39;t have a public blogging mechanism so I&#39;ll just describe<br>

what we&#39;re up to here. Some of this was posted in some form in the past.<br>

Since this contains the raw materials, I could make a wiki-ized version<br>

if there were a public place to put it.<br>

<br>

<br>

<br>

We currently use gluster in two parts of cluster management.<br>

<br>

In fact, gluster in our management node infrastructure is helping us to<br>

provide scaling and consistency to some of the largest clusters in the<br>

world, clusters in the TOP100 list. While I can get in to trouble by<br>

sharing too much, I will just say that trends are continuing and the<br>

future may have some exciting announcements on where on TOP100 certain<br>

new giant systems may end up in the coming 1-2 years.<br>

<br>

At HPE, HPCM is the &quot;traditional cluster manager.&quot; There is another team<br>

that develops a more cloud-like solution and I am not discussing that<br>

solution here.<br>

<br>

<br>

Use Case #1: Leader Nodes and Scale Out<br>

------------------------------------------------------------------------------<br>

- Why?<br>

  * Scale out<br>

  * Redundancy (combined with CTDB, any leader can fail)<br>

  * Consistency (All servers and compute agree on what the content is)<br>

<br>

- Cluster manager has an admin or head node and zero or more leader nodes<br>

<br>

- Leader nodes are provisioned in groups of 3 to use distributed<br>

  replica-3 volumes (although at least one customer has interest<br>

  in replica-5)<br>

<br>

- We configure a few different volumes for different use cases<br>

<br>

- We use Gluster NFS still because, over a year ago, Ganesha was not<br>

  working with our workload and we haven&#39;t had time to re-test and<br>

  engage with the community. No blame - we would also owe making sure<br>

  our settings are right.<br>

<br>

- We use CTDB for a measure of HA and IP alias management. We use this<br>

  instead of pacemaker to reduce complexity.<br>

<br>

- The volume use cases are:<br>

  * Image sharing for diskless compute nodes (sometimes 6,000 nodes)<br>

    -&gt; Normally squashFS image files for speed/efficiency exported NFS<br>

    -&gt; Expanded (&quot;chrootable&quot;) traditional NFS trees for people who<br>

       prefer that, but they don&#39;t scale as well and are slower to boot<br>

    -&gt; Squashfs images sit on a sharded volume while traditional gluster<br>

       used for expanded tree.<br>

  * TFTP/HTTP for network boot/PXE including miniroot<br>

    -&gt; Spread across leaders too due so one node is not saturated with<br>

       PXE/DHCP requests<br>

    -&gt; Miniroot is a &quot;fatter initrd&quot; that has our CM toolchain<br>

  * Logs/consoles<br>

    -&gt; For traditional logs and consoles (HCPM also uses<br>

       elasticsearch/kafka/friends but we don&#39;t put that in gluster)<br>

    -&gt; Separate volume to have more non-cached friendly settings<br>

  * 4 total volumes used (one sharded, one heavily optimized for<br>

    caching, one for ctdb lock, and one traditional for logging/etc)<br>

<br>

- Leader Setup<br>

  * Admin node installs the leaders like any other compute nodes<br>

  * A setup tool operates that configures gluster volumes and CTDB<br>

  * When ready, an admin/head node can be engaged with the leaders<br>

  * At that point, certain paths on the admin become gluster fuse mounts<br>

    and bind mounts to gluster fuse mounts.<br>

<br>

- How images are deployed (squashfs mode)<br>

  * User creates an image using image creation tools that make a<br>

    chrootable tree style image on the admin/head node<br>

  * mksquashfs will generate a squashfs image file on to a shared<br>

    storage gluster mount<br>

  * Nodes will mount the filesystem with the squashfs images and then<br>

    loop mount the squashfs as part of the boot process.<br>

<br>

- How are compute nodes tied to leaders<br>

  * We simply have a variable in our database where human or automated<br>

    discovery tools can assign a given node to a given IP alias. This<br>

    works better for us than trying to play routing tricks or load<br>

    balance tricks<br>

  * When leaders PXE, the DHCP response includes next-server and the<br>

    compute node uses the leader IP alias for the tftp/http for<br>

    getting the boot loader DHCP config files are on shared storage<br>

    to facilitate future scaling of DHCP services.<br>

  * ipxe or grub2 network config files then fetch the kernel, initrd<br>

  * initrd has a small update to load a miniroot (install environment)<br>

     which has more tooling<br>

  * Node is installed (for nodes with root disks) or does a network boot<br>

    cycle.<br>

<br>

- Gluster sizing<br>

  * We typically state compute nodes per leader but this is not for<br>

    gluster per-se. Squashfs image objects are very efficient and<br>

    probably would be fine for 2k nodes per leader. Leader nodes provide<br>

    other services including console logs, system logs, and monitoring<br>

    services.<br>

  * Our biggest deployment at a customer site right now has 24 leader<br>

    nodes. Bigger systems are coming.<br>

<br>

- Startup scripts - Getting all the gluster mounts and many bind mounts<br>

  used in the solution, as well as ensuring gluster mounts and ctdb lock<br>

  is available before ctdb start was too painful for my brian. So we<br>

  have systemd startup scripts that sanity test and start things<br>

  gracefully.<br>

<br>

- Future: The team is starting to test what a 96-leader (96 gluster<br>

  servers) might look like for future exascale systems.<br>

<br>

- Future: Some customers have interest in replica-5 instead of<br>

  replica-3. We want to test performance implications.<br>

<br>

- Future: Move to Ganesha, work with community if needed<br>

<br>

- Future: Work with Gluster community to make gluster fuse mounts<br>

  efficient instead of NFS (may be easier with image objects than it was<br>

  the old way with fully expanded trees for images!)<br>

<br>

- Challenges:<br>

  * Every month someone tells me gluster is going to die because of Red<br>

    Hat vs IBM and I have to justify things. It is getting harder.<br>

  * Giant squashfs images fail - mksquashfs reports error - at around<br>

    90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don&#39;t have<br>

    the bandwidth to dig in right now but one upset customer. Work<br>

    arounds provided to move development environment for that customer<br>

    out of the operating system image.<br>

  * Since we have our own build and special use cases, we&#39;re on our own<br>

    for support (by &quot;on our own&quot; I mean no paid support, community help<br>

    only). Our complex situations can produce some cases that you guys<br>

    don&#39;t see and debugging them can take a month or more with the<br>

    volunteer nature of the community. Paying for support is harder,<br>

    even if it were viable politically, since we support 6 distros and 3<br>

    distro providers. Of course, paying for support is never the<br>

    preference of management. It might be an interesting thing to<br>

    investigate.<br>

  * Any gluster update is terror. We don&#39;t update much because finding a<br>

    gluster version that is stable for all our use cases PLUS being able to<br>

    test at scale which means thousands of nodes is hard. We did some<br>

    internal improvements here where we emulate a 2500-node-cluster<br>

    using virtual machines on a much smaller cluster. However, it isn&#39;t<br>

    ideal. So we start lagging the community over time until some<br>

    problem forces us to refresh. Then we tip-toe in to the update. We<br>

    most recently updated to gluster79 and it solved several problems<br>

    related to use case #2 below.<br>

  * Due to lack of HW, testing the SU Leader solution is hard because of<br>

    the number of internal clusters. However, I recently moved my<br>

    primary development to my beefy desktop where I have a full cluster<br>

    stack including 3 leader nodes with gluster running in virtual<br>

    machines. So we have eliminated an excuse preventing internal people<br>

    from playing with the solution stack.<br>

  * Growing volumes, replacing bricks, and replacing servers work.<br>

    However, the process is very involved and quirky for us. I have<br>

    complicated code that has to do more than I&#39;d really like to do to<br>

    simply wheel in a complete leader replacement for a failed one. Even<br>

    with our tools, we often send up with some glusterd&#39;s not starting<br>

    right and have to restart a few times to get a node or brick<br>

    replacement going. I wish the process were just as simple as running<br>

    a single command or set of commands and having it do the right<br>

    thing.<br>

       -&gt; Finding two leaders to get a complete set of peer files<br>

       -&gt; Wedging in the replacement node with the UUID the node in that<br>

          position had before<br>

       -&gt; Don&#39;t accidentally give a gluster peer it&#39;s own peer file<br>

       -&gt; Then go through an involved replace brick procedure<br>

       -&gt; &quot;replace&quot; vs &quot; reset&quot;<br>

       -&gt; A complicated dance with XATTRs that I still don&#39;t understand<br>

       -&gt; Ensuring indices and .glusterfs pre-exist with right<br>

          permissions<br>

    My hope is I just misunderstand and will bring this up in a future<br>

    discussion.<br>

<br>

<br>

<br>

<br>

Use Case #2: Admin/Head Node High Availability<br>

------------------------------------------------------------------------------<br>

- We used to have an admin node HA solution that contained two servers<br>

  and an external storage device. A VM was used for the &quot;real admin<br>

  node&quot; provided by the two servers.<br>

<br>

- This solution was expensive due to the external storage<br>

<br>

- This solution was less optimal due to not having true quorum<br>

<br>

- Building on our gluster experience, we removed the external storage<br>

  and added a 3rd server.<br>

     (Due to some previous experience with DRBD we elected to stay with<br>

      gluster here)<br>

<br>

- We build a gluster volume shared across the 3 servers, sharded, which<br>

  primarily holds a virtual machine image file used by the admin node VM<br>

<br>

- The physical nodes use bonding for network redundancy and bridges to<br>

  feed them in to the VM.<br>

<br>

- We use pacemaker in this case to manage the setup<br>

<br>

- It&#39;s pretty simple - pacemaker rules manage a VirtualDomain instance<br>

  and a simple ssh monitor makes sure we can get in to it.<br>

<br>

- We typically have a 3-12T single image sitting on gluster sharded<br>

  shared storage used by the virtual machine, which forms the true admin<br>

  node.<br>

<br>

- We set this up with manual instructions but tooling is coming soon to<br>

  aid in automated setup of this solution<br>

<br>

- This solution is in use actively in at least 4 very large<br>

  supercomputers.<br>

<br>

- I am impressed by the speed of the solution on the sharded volume. The<br>

  VM image creation speed using libvirt to talk to the image file hosted<br>

  on a sharded gluster volume works slick.<br>

    (We use the fuse mount because we don&#39;t want to build our own<br>

     qemu/libvirt, which would be needed at least for SLES and maybe<br>

     RHEL too since we have our own gluster build)<br>

<br>

- Challenges:<br>

  * Not being able to boot all of a sudden was a terror for us (where<br>

    the VM would only see the disk size as the size of a shard at random<br>

    times).<br>

    -&gt; Thankfully community helped guide us to gluster79 and that resolved it<br>

<br>

  * People keep asking to make a 2-node version but I push back.<br>

    Especially with gluster but honestly with other solutions too, don&#39;t<br>

    cheap out with arbitration is what I try to tell people.<br>

<br>

<br>

Network Utilization<br>

------------------------------------------------------------------------------<br>

- We tyhpically have 2x 10G bonds on leaders<br>

- For simplicity, we co-mingle gluster and compute node traffic together<br>

- It is very rare that we approach 20G full utilization even in very<br>

  intensive operations<br>

- Newer solutions may increase the speed of the bonds, but it isn&#39;t due<br>

  to a current pressing need.<br>

- Locality Challenge:<br>

   * For future Exascale systems, we may need to become concerned about<br>

     locality<br>

   * Certain compute nodes are far closer to some gluster servers than<br>

     others<br>

   * And gluster servers themselves need to talk among themselves but<br>

     could be stretched across the topology<br>

   * We have tools to monitor switch link utilization and so far have<br>

     not hit a scary zone<br>

   * Somewhat complicated by fault tolerance. It would be sad to design<br>

     the leaders such that a PDU goes bad so you lose qourum because<br>

     3 leaders in the same replica-3 were on the same PDU<br>

   * But spreading them out may have locality implications<br>

   * This is a future area of study for us. We have various ideas in<br>

     mind if a problem strikes.<br>

________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

</blockquote></div>