<div dir="ltr"><div>Hi Erik,</div><div><br></div><div>It&#39;s great to hear positive feedback! Thanks for taking out time to send out this email. It means a lot to us :) <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 9, 2020 at 10:55 AM Strahil Nikolov &lt;<a href="mailto:hunter86_bg@yahoo.com">hunter86_bg@yahoo.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson &lt;<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>&gt; wrote:<br>

&gt;I wanted to share some positive news with the group here.<br>

&gt;<br>

&gt;Summary: Using sharding and squashfs image files instead of expanded<br>

&gt;directory trees for RO NFS OS images have led to impressive boot times<br>

&gt;of<br>

&gt;2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.<br>

&gt;<br>

&gt;Details:<br>

&gt;<br>

&gt;As you may have seen in some of my other posts, we have been using<br>

&gt;gluster to boot giant clusters, some of which are in the top500 list of<br>

&gt;HPC resources. The compute nodes are diskless.<br>

&gt;<br>

&gt;Up until now, we have done this by pushing an operating system from our<br>

&gt;head node to the storage cluster, which is made up of one or more<br>

&gt;3-server/(3-brick) subvolumes in a distributed/replicate configuration.<br>

&gt;The servers are also PXE-boot and tftboot servers and also serve the<br>

&gt;&quot;miniroot&quot; (basically a fat initrd with a cluster manager toolchain).<br>

&gt;We also locate other management functions there unrelated to boot and<br>

&gt;root.<br>

&gt;<br>

&gt;This copy of the operating system is a simple a directory tree<br>

&gt;representing the whole operating system image. You could &#39;chroot&#39; in to<br>

&gt;it, for example.<br>

&gt;<br>

&gt;So this operating system is a read-only NFS mount point used as a base<br>

&gt;by all compute nodes to use as their root filesystem.<br>

&gt;<br>

&gt;This has been working well, getting us boot times (not including BIOS<br>

&gt;startup) of between 10 and 15 minutes for a 2,000 node cluster.<br>

&gt;Typically a<br>

&gt;cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On<br>

&gt;simple<br>

&gt;RHEL8 images without much customization, I tend to get 10 minutes.<br>

&gt;<br>

&gt;We have observed some slow-downs with certain job launch work loads for<br>

&gt;customers who have very metadata intensive job launch. The metadata<br>

&gt;load<br>

&gt;of such an operation is very intensive, with giant loads being observed<br>

&gt;on the gluster servers.<br>

&gt;<br>

&gt;We recently started supporting RW NFS as opposed to TMPFS for this<br>

&gt;solution for the writable components of root. Our customers tend to<br>

&gt;prefer<br>

&gt;to keep every byte of memory for jobs. We came up with a solution of<br>

&gt;hosting<br>

&gt;the RW NFS sparse files with XFS filesystems on top from a writable<br>

&gt;area in<br>

&gt;gluster for NFS. This makes the RW NFS solution very fast because it<br>

&gt;reduces<br>

&gt;RW NFS metadata per-node. Boot times didn&#39;t go up significantly (but<br>

&gt;our<br>

&gt;first attempt with just using a directory tree was a slow disaster,<br>

&gt;hitting<br>

&gt;the worse-case lots of small file write + lots of metadata work load).<br>

&gt;So we<br>

&gt;solved that problem with XFS FS images on RW NFS.<br>

&gt;<br>

&gt;Building on that idea, we have in our development branch, a version of<br>

&gt;the<br>

&gt;solution that changes the RO NFS image to a squashfs file on a sharding<br>

&gt;volume. That is, instead of each operating system being many thousands<br>

&gt;of files and being (slowly) synced to the gluser servers, the head node<br>

&gt;makes a squashfs file out of the image and pushes that. Then all the<br>

&gt;compute nodes mount the squashfs image from the NFS mount.<br>

&gt;  (mount RO NFS mount, loop-mount squashfs image).<br>

&gt;<br>

&gt;On a 2,000 node cluster I had access to for a time, our prototype got<br>

&gt;us<br>

&gt;boot times of 5 minutes -- including RO NFS with squashfs and the RW<br>

&gt;NFS<br>

&gt;for writable areas like /etc, /var, etc (on an XFS image file).<br>

&gt;  * We also tried RW NFS with OVERLAY and no problem there<br>

&gt;<br>

&gt;I expect, for people who prefer the squashfs non-expanded format, we<br>

&gt;can<br>

&gt;reduce the leader per compute density.<br>

&gt;<br>

&gt;Now, not all customers will want squashfs. Some want to be able to edit<br>

&gt;a file and see it instantly on all nodes. However, customers looking<br>

&gt;for<br>

&gt;fast boot times or who are suffering slowness on metadata intensive<br>

&gt;job launch work loads, will have a new fast option.<br>

&gt;<br>

&gt;Therefore, it&#39;s very important we still solve the bug we&#39;re working on<br>

&gt;in another thread. But I wanted to share something positive.<br>

&gt;<br>

&gt;So now I&#39;ve said something positive instead of only asking for help :)<br>

&gt;:)<br>

&gt;<br>

&gt;Erik<br>

&gt;________<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;Community Meeting Calendar:<br>

&gt;<br>

&gt;Schedule -<br>

&gt;Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

&gt;Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>

&gt;<br>

&gt;Gluster-users mailing list<br>

&gt;<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

&gt;<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

<br>

Good Job Erik!<br>

<br>

Best Regards,<br>

Strahil Nikolov<br>

________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>

<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Regards,<br>Hari Gowtham.</div>