[Gluster-devel] Multiplexing - good news, bad news, and a plea for help

Tue Sep 20 08:10:13 UTC 2016

On 19/09/16 15:26, Jeff Darcy wrote:
> I have brick multiplexing[1] functional to the point that it passes all basic AFR, EC, and quota tests.  There are still some issues with tiering, and I wouldn't consider snapshots functional at all, but it seemed like a good point to see how well it works.  I ran some *very simple* tests with 20 volumes, each 2x distribute on top of 2x replicate.
>
> First, the good news: it worked!  Getting 80 bricks to come up in the same process, and then run I/O correctly across all of those, is pretty cool.  Also, memory consumption is *way* down.  RSS size went from 1.1GB before (total across 80 processes) to about 400MB (one process) with multiplexing.  Each process seems to consume approximately 8MB globally plus 5MB per brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408.  Just considering the amount of memory, this means we could support about three times as many bricks as before.  When memory *contention* is considered, the difference is likely to be even greater.
>
> Bad news: some of our code doesn't scale very well in terms of CPU use.  To test performance I ran a test which would create 20,000 files across all 20 volumes, then write and delete them, all using 100 client threads.  This is similar to what smallfile does, but deliberately constructed to use a minimum of disk space - at any given, only one file per thread (maximum) actually has 4KB worth of data in it.  This allows me to run it against SSDs or even ramdisks even with high brick counts, to factor out slow disks in a study of CPU/memory issues.  Here are some results and observations.
>
> * On my first run, the multiplexed version of the test took 77% longer to run than the non-multiplexed version (5:42 vs. 3:13).  And that was after I'd done some hacking to use 16 epoll threads.  There's something a bit broken about trying to set that option normally, so that the value you set doesn't actually make it to the place that tries to spawn the threads.  Bumping this up further to 32 threads didn't seem to help.
>
> * A little profiling showed me that we're spending almost all of our time in pthread_spin_lock.  I disabled the code to use spinlocks instead of regular mutexes, which immediately improved performance and also reduced CPU time by almost 50%.
>
> * The next round of profiling showed that a lot of the locking is in mem-pool code, and a lot of that in turn is from dictionary code.  Changing the dict code to use malloc/free instead of mem_get/mem_put gave another noticeable boost.

That's weird, since the only purpose of the mem-pool was precisely to 
improve performance of allocation of objects that are frequently 
allocated/released.

>
> At this point run time was down to 4:50, which is 20% better than where I started but still far short of non-multiplexed performance.  I can drive that down still further by converting more things to use malloc/free.  There seems to be a significant opportunity here to improve performance - even without multiplexing - by taking a more careful look at our memory-management strategies:
>
> * Tune the mem-pool implementation to scale better with hundreds of threads.
>
> * Use mem-pools more selectively, or even abandon them altogether.
>
> * Try a different memory allocator such as jemalloc.
>
> I'd certainly appreciate some help/collaboration in studying these options further.  It's a great opportunity to make a large impact on overall performance without a lot of code or specialized knowledge.  Even so, however, I don't think memory management is our only internal scalability problem.  There must be something else limiting parallelism, and quite severely at that.  My first guess is io-threads, so I'll be looking into that first, but if anybody else has any ideas please let me know.  There's no *good* reason why running many bricks in one process should be slower than running them in separate processes.  If it remains slower, then the limit on the number of bricks and volumes we can support will remain unreasonably low.  Also, the problems I'm seeing here probably don't *only* affect multiplexing.  Excessive memory/CPU use and poor parallelism are issues that we kind of need to address anyway, so if anybody has any ideas please let me know.

You have made a really good job :)

Some points I would look into:

* Consider http://review.gluster.org/15036/. With all communications 
going through the same socket, the problem this patch tries to solve 
could become worse.

* We should consider the possibility of implementing a global thread 
pool, which would replace io-threads, epoll threads and maybe others. 
Synctasks should also rely on this thread pool. This has the benefit of 
better controlling the total number of threads. Otherwise when we have 
more threads than processor cores, we waste resources unnecessarily and 
we won't get a real gain. Even worse, it could start to degrade due to 
contention.

* There are *too many* mutexes in the code. We should drastically reduce 
its use. Sometimes by using better structures that do not require 
blocking at all or even introducing RCU and/or rwlocks. One case that 
I've always had doubts is dict_t. Why does it need locks ? Once xlator 
should not modify a dict_t once it has been passed to another xlator, 
and if we assume that a dict can only be modified by a single xlator at 
a time, it's very unlikely that it needs to modify it from multiple threads.

I'm a bit busy right now, but I'll try to review the patch.

Xavi

>
>
>
> [1] http://review.gluster.org/#/c/14763/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>