[Gluster-users] Quick update on glusterd's volume scalability improvements

Sat Mar 30 04:16:29 UTC 2019

On Sat, 30 Mar 2019 at 08:06, Vijay Bellur <vbellur at redhat.com> wrote:

>
>
> On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee <amukherj at redhat.com>
> wrote:
>
>> All,
>>
>> As many of you already know that the design logic with which GlusterD
>> (here on to be referred as GD1) was implemented has some fundamental
>> scalability bottlenecks at design level, especially around it's way of
>> handshaking configuration meta data and replicating them across all the
>> peers. While the initial design was adopted with a factor in mind that GD1
>> will have to deal with just few tens of nodes/peers and volumes, the
>> magnitude of the scaling bottleneck this design can bring in was never
>> realized and estimated.
>>
>> Ever since Gluster has been adopted in container storage land as one of
>> the storage backends, the business needs have changed. From tens of
>> volumes, the requirements have translated to hundreds and now to thousands.
>> We introduced brick multiplexing which had given some relief to have a
>> better control on the memory footprint when having many number of
>> bricks/volumes hosted in the node, but this wasn't enough. In one of our (I
>> represent Red Hat) customer's deployment  we had seen on a 3 nodes cluster,
>> whenever the number of volumes go beyond ~1500 and for some reason if one
>> of the storage pods get rebooted, the overall time it takes to complete the
>> overall handshaking (not only in a factor of n X n peer handshaking but
>> also the number of volume iterations, building up the dictionary and
>> sending it over the write) consumes a huge time as part of the handshaking
>> process, the hard timeout of an rpc request which is 10 minutes gets
>> expired and we see cluster going into a state where none of the cli
>> commands go through and get stuck.
>>
>> With such problem being around and more demand of volume scalability, we
>> started looking into these areas in GD1 to focus on improving (a) volume
>> scalability (b) node scalability. While (b) is a separate topic for some
>> other day we're going to focus on more on (a) today.
>>
>> While looking into this volume scalability problem with a deep dive, we
>> realized that most of the bottleneck which was causing the overall delay in
>> the friend handshaking and exchanging handshake packets between peers in
>> the cluster was iterating over the in-memory data structures of the
>> volumes, putting them into the dictionary sequentially. With 2k like
>> volumes the function glusterd_add_volumes_to_export_dict () was quite
>> costly and most time consuming. From pstack output when glusterd instance
>> was restarted in one of the pods, we could always see that control was
>> iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3
>> nodes cluster, this function itself took almost *7.5 minutes . *The
>> bottleneck is primarily because of sequential iteration of volumes,
>> sequentially updating the dictionary with lot of (un)necessary keys.
>>
>> So what we tried out was making this loop to work on a worker thread
>> model so that multiple threads can process a range of volume list and not
>> all of them so that we can get more parallelism within glusterd. But with
>> that we still didn't see any improvement and the primary reason for that
>> was our dictionary APIs need locking. So the next idea was to actually make
>> threads work on multiple dictionaries and then once all the volumes are
>> iterated the subsequent dictionaries to be merged into a single one. Along
>> with these changes there are few other improvements done on skipping
>> comparison of snapshots if there's no snap available, excluding tiering
>> keys if the volume type is not tier. With this enhancement [1] we see the
>> overall time it took to complete building up the dictionary from the
>> in-memory structure is *2 minutes 18 seconds* which is close*  ~3x*
>> improvement. We firmly believe that with this improvement, we should be
>> able to scale up to 2000 volumes on a 3 node cluster and that'd help our
>> users to get benefited with supporting more PVCs/volumes.
>>
>> Patch [1] is still in testing and might undergo few minor changes. But we
>> welcome you for review and comment on it. We plan to get this work
>> completed, tested and release in glusterfs-7.
>>
>> Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc)
>> for all the work done on this for last few days. Thank you Mohit!
>>
>>
>
> This sounds good! Thank you for the update on this work.
>
> Did you ever consider using etcd with GD1 (like as it is used with GD2)?
>

Honestly I had thought about it few times, but the primary reason was not
to go forward with that direction was the bandwidth as such improvements
isn’t a short term and tiny tasks, and also to keep in mind that GD2 tasks
were in our plate too. If any other contributors are willing to take this
up, I am more than happy to collaborate and provide guidance.

Having etcd as a backing store for configuration could remove expensive
> handshaking as well as persistence of configuration on every node. I am
> interested in understanding if you are aware of any drawbacks with that
> approach. If there haven't been any thoughts in that direction, it might be
> a fun experiment to try.
>
> Thanks,
> Vijay
>
-- 
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190330/cb658165/attachment.html>