[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

Hari Gowtham hgowtham at redhat.com
Thu Apr 9 09:02:40 UTC 2020

Hi Erik,

It's great to hear positive feedback! Thanks for taking out time to send
out this email. It means a lot to us :)

On Thu, Apr 9, 2020 at 10:55 AM Strahil Nikolov <hunter86_bg at yahoo.com>

> On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson <
> erik.jacobson at hpe.com> wrote:
> >I wanted to share some positive news with the group here.
> >
> >Summary: Using sharding and squashfs image files instead of expanded
> >directory trees for RO NFS OS images have led to impressive boot times
> >of
> >2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.
> >
> >Details:
> >
> >As you may have seen in some of my other posts, we have been using
> >gluster to boot giant clusters, some of which are in the top500 list of
> >HPC resources. The compute nodes are diskless.
> >
> >Up until now, we have done this by pushing an operating system from our
> >head node to the storage cluster, which is made up of one or more
> >3-server/(3-brick) subvolumes in a distributed/replicate configuration.
> >The servers are also PXE-boot and tftboot servers and also serve the
> >"miniroot" (basically a fat initrd with a cluster manager toolchain).
> >We also locate other management functions there unrelated to boot and
> >root.
> >
> >This copy of the operating system is a simple a directory tree
> >representing the whole operating system image. You could 'chroot' in to
> >it, for example.
> >
> >So this operating system is a read-only NFS mount point used as a base
> >by all compute nodes to use as their root filesystem.
> >
> >This has been working well, getting us boot times (not including BIOS
> >startup) of between 10 and 15 minutes for a 2,000 node cluster.
> >Typically a
> >cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On
> >simple
> >RHEL8 images without much customization, I tend to get 10 minutes.
> >
> >We have observed some slow-downs with certain job launch work loads for
> >customers who have very metadata intensive job launch. The metadata
> >load
> >of such an operation is very intensive, with giant loads being observed
> >on the gluster servers.
> >
> >We recently started supporting RW NFS as opposed to TMPFS for this
> >solution for the writable components of root. Our customers tend to
> >prefer
> >to keep every byte of memory for jobs. We came up with a solution of
> >hosting
> >the RW NFS sparse files with XFS filesystems on top from a writable
> >area in
> >gluster for NFS. This makes the RW NFS solution very fast because it
> >reduces
> >RW NFS metadata per-node. Boot times didn't go up significantly (but
> >our
> >first attempt with just using a directory tree was a slow disaster,
> >hitting
> >the worse-case lots of small file write + lots of metadata work load).
> >So we
> >solved that problem with XFS FS images on RW NFS.
> >
> >Building on that idea, we have in our development branch, a version of
> >the
> >solution that changes the RO NFS image to a squashfs file on a sharding
> >volume. That is, instead of each operating system being many thousands
> >of files and being (slowly) synced to the gluser servers, the head node
> >makes a squashfs file out of the image and pushes that. Then all the
> >compute nodes mount the squashfs image from the NFS mount.
> >  (mount RO NFS mount, loop-mount squashfs image).
> >
> >On a 2,000 node cluster I had access to for a time, our prototype got
> >us
> >boot times of 5 minutes -- including RO NFS with squashfs and the RW
> >NFS
> >for writable areas like /etc, /var, etc (on an XFS image file).
> >  * We also tried RW NFS with OVERLAY and no problem there
> >
> >I expect, for people who prefer the squashfs non-expanded format, we
> >can
> >reduce the leader per compute density.
> >
> >Now, not all customers will want squashfs. Some want to be able to edit
> >a file and see it instantly on all nodes. However, customers looking
> >for
> >fast boot times or who are suffering slowness on metadata intensive
> >job launch work loads, will have a new fast option.
> >
> >Therefore, it's very important we still solve the bug we're working on
> >in another thread. But I wanted to share something positive.
> >
> >So now I've said something positive instead of only asking for help :)
> >:)
> >
> >Erik
> >________
> >
> >
> >
> >Community Meeting Calendar:
> >
> >Schedule -
> >Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >Bridge: https://bluejeans.com/441850968
> >
> >Gluster-users mailing list
> >Gluster-users at gluster.org
> >https://lists.gluster.org/mailman/listinfo/gluster-users
> Good Job Erik!
> Best Regards,
> Strahil Nikolov
> ________
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

Hari Gowtham.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200409/d8552a0a/attachment.html>

More information about the Gluster-users mailing list