[Gluster-devel] Quick update on glusterd's volume scalability improvements

Fri Mar 29 13:41:35 UTC 2019

All,

As many of you already know that the design logic with which GlusterD (here
on to be referred as GD1) was implemented has some fundamental scalability
bottlenecks at design level, especially around it's way of handshaking
configuration meta data and replicating them across all the peers. While
the initial design was adopted with a factor in mind that GD1 will have to
deal with just few tens of nodes/peers and volumes, the magnitude of the
scaling bottleneck this design can bring in was never realized and
estimated.

Ever since Gluster has been adopted in container storage land as one of the
storage backends, the business needs have changed. From tens of volumes,
the requirements have translated to hundreds and now to thousands. We
introduced brick multiplexing which had given some relief to have a better
control on the memory footprint when having many number of bricks/volumes
hosted in the node, but this wasn't enough. In one of our (I represent Red
Hat) customer's deployment  we had seen on a 3 nodes cluster, whenever the
number of volumes go beyond ~1500 and for some reason if one of the storage
pods get rebooted, the overall time it takes to complete the overall
handshaking (not only in a factor of n X n peer handshaking but also the
number of volume iterations, building up the dictionary and sending it over
the write) consumes a huge time as part of the handshaking process, the
hard timeout of an rpc request which is 10 minutes gets expired and we see
cluster going into a state where none of the cli commands go through and
get stuck.

With such problem being around and more demand of volume scalability, we
started looking into these areas in GD1 to focus on improving (a) volume
scalability (b) node scalability. While (b) is a separate topic for some
other day we're going to focus on more on (a) today.

While looking into this volume scalability problem with a deep dive, we
realized that most of the bottleneck which was causing the overall delay in
the friend handshaking and exchanging handshake packets between peers in
the cluster was iterating over the in-memory data structures of the
volumes, putting them into the dictionary sequentially. With 2k like
volumes the function glusterd_add_volumes_to_export_dict () was quite
costly and most time consuming. From pstack output when glusterd instance
was restarted in one of the pods, we could always see that control was
iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3
nodes cluster, this function itself took almost *7.5 minutes . *The
bottleneck is primarily because of sequential iteration of volumes,
sequentially updating the dictionary with lot of (un)necessary keys.

So what we tried out was making this loop to work on a worker thread model
so that multiple threads can process a range of volume list and not all of
them so that we can get more parallelism within glusterd. But with that we
still didn't see any improvement and the primary reason for that was our
dictionary APIs need locking. So the next idea was to actually make threads
work on multiple dictionaries and then once all the volumes are iterated
the subsequent dictionaries to be merged into a single one. Along with
these changes there are few other improvements done on skipping comparison
of snapshots if there's no snap available, excluding tiering keys if the
volume type is not tier. With this enhancement [1] we see the overall time
it took to complete building up the dictionary from the in-memory structure
is *2 minutes 18 seconds* which is close*  ~3x* improvement. We firmly
believe that with this improvement, we should be able to scale up to 2000
volumes on a 3 node cluster and that'd help our users to get benefited with
supporting more PVCs/volumes.

Patch [1] is still in testing and might undergo few minor changes. But we
welcome you for review and comment on it. We plan to get this work
completed, tested and release in glusterfs-7.

Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc)
for all the work done on this for last few days. Thank you Mohit!

[1] https://review.gluster.org/#/c/glusterfs/+/22445/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190329/c10e5e21/attachment-0001.html>