[Gluster-users] Quick update on glusterd's volume scalability improvements

Sat Mar 30 02:36:21 UTC 2019

On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee <amukherj at redhat.com> wrote:

> All,
>
> As many of you already know that the design logic with which GlusterD
> (here on to be referred as GD1) was implemented has some fundamental
> scalability bottlenecks at design level, especially around it's way of
> handshaking configuration meta data and replicating them across all the
> peers. While the initial design was adopted with a factor in mind that GD1
> will have to deal with just few tens of nodes/peers and volumes, the
> magnitude of the scaling bottleneck this design can bring in was never
> realized and estimated.
>
> Ever since Gluster has been adopted in container storage land as one of
> the storage backends, the business needs have changed. From tens of
> volumes, the requirements have translated to hundreds and now to thousands.
> We introduced brick multiplexing which had given some relief to have a
> better control on the memory footprint when having many number of
> bricks/volumes hosted in the node, but this wasn't enough. In one of our (I
> represent Red Hat) customer's deployment  we had seen on a 3 nodes cluster,
> whenever the number of volumes go beyond ~1500 and for some reason if one
> of the storage pods get rebooted, the overall time it takes to complete the
> overall handshaking (not only in a factor of n X n peer handshaking but
> also the number of volume iterations, building up the dictionary and
> sending it over the write) consumes a huge time as part of the handshaking
> process, the hard timeout of an rpc request which is 10 minutes gets
> expired and we see cluster going into a state where none of the cli
> commands go through and get stuck.
>
> With such problem being around and more demand of volume scalability, we
> started looking into these areas in GD1 to focus on improving (a) volume
> scalability (b) node scalability. While (b) is a separate topic for some
> other day we're going to focus on more on (a) today.
>
> While looking into this volume scalability problem with a deep dive, we
> realized that most of the bottleneck which was causing the overall delay in
> the friend handshaking and exchanging handshake packets between peers in
> the cluster was iterating over the in-memory data structures of the
> volumes, putting them into the dictionary sequentially. With 2k like
> volumes the function glusterd_add_volumes_to_export_dict () was quite
> costly and most time consuming. From pstack output when glusterd instance
> was restarted in one of the pods, we could always see that control was
> iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3
> nodes cluster, this function itself took almost *7.5 minutes . *The
> bottleneck is primarily because of sequential iteration of volumes,
> sequentially updating the dictionary with lot of (un)necessary keys.
>
> So what we tried out was making this loop to work on a worker thread model
> so that multiple threads can process a range of volume list and not all of
> them so that we can get more parallelism within glusterd. But with that we
> still didn't see any improvement and the primary reason for that was our
> dictionary APIs need locking. So the next idea was to actually make threads
> work on multiple dictionaries and then once all the volumes are iterated
> the subsequent dictionaries to be merged into a single one. Along with
> these changes there are few other improvements done on skipping comparison
> of snapshots if there's no snap available, excluding tiering keys if the
> volume type is not tier. With this enhancement [1] we see the overall time
> it took to complete building up the dictionary from the in-memory structure
> is *2 minutes 18 seconds* which is close*  ~3x* improvement. We firmly
> believe that with this improvement, we should be able to scale up to 2000
> volumes on a 3 node cluster and that'd help our users to get benefited with
> supporting more PVCs/volumes.
>
> Patch [1] is still in testing and might undergo few minor changes. But we
> welcome you for review and comment on it. We plan to get this work
> completed, tested and release in glusterfs-7.
>
> Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc)
> for all the work done on this for last few days. Thank you Mohit!
>
>

This sounds good! Thank you for the update on this work.

Did you ever consider using etcd with GD1 (like as it is used with GD2)?
Having etcd as a backing store for configuration could remove expensive
handshaking as well as persistence of configuration on every node. I am
interested in understanding if you are aware of any drawbacks with that
approach. If there haven't been any thoughts in that direction, it might be
a fun experiment to try.

Thanks,
Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190329/776fb03f/attachment.html>