[Gluster-devel] Multiplexing - good news, bad news, and a plea for help

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Sep 20 09:09:51 UTC 2016


Jeff,
        If I understood brick-multiplexing correctly,
add-brick/remove-brick add/remove graphs right? I don't think the
grah-cleanup is in good shape, i.e. it should lead to memory leaks etc. Did
you get a chance to think about it?

On Mon, Sep 19, 2016 at 6:56 PM, Jeff Darcy <jdarcy at redhat.com> wrote:

> I have brick multiplexing[1] functional to the point that it passes all
> basic AFR, EC, and quota tests.  There are still some issues with tiering,
> and I wouldn't consider snapshots functional at all, but it seemed like a
> good point to see how well it works.  I ran some *very simple* tests with
> 20 volumes, each 2x distribute on top of 2x replicate.
>
> First, the good news: it worked!  Getting 80 bricks to come up in the same
> process, and then run I/O correctly across all of those, is pretty cool.
> Also, memory consumption is *way* down.  RSS size went from 1.1GB before
> (total across 80 processes) to about 400MB (one process) with
> multiplexing.  Each process seems to consume approximately 8MB globally
> plus 5MB per brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408.  Just
> considering the amount of memory, this means we could support about three
> times as many bricks as before.  When memory *contention* is considered,
> the difference is likely to be even greater.
>
> Bad news: some of our code doesn't scale very well in terms of CPU use.
> To test performance I ran a test which would create 20,000 files across all
> 20 volumes, then write and delete them, all using 100 client threads.  This
> is similar to what smallfile does, but deliberately constructed to use a
> minimum of disk space - at any given, only one file per thread (maximum)
> actually has 4KB worth of data in it.  This allows me to run it against
> SSDs or even ramdisks even with high brick counts, to factor out slow disks
> in a study of CPU/memory issues.  Here are some results and observations.
>
> * On my first run, the multiplexed version of the test took 77% longer to
> run than the non-multiplexed version (5:42 vs. 3:13).  And that was after
> I'd done some hacking to use 16 epoll threads.  There's something a bit
> broken about trying to set that option normally, so that the value you set
> doesn't actually make it to the place that tries to spawn the threads.
> Bumping this up further to 32 threads didn't seem to help.
>
> * A little profiling showed me that we're spending almost all of our time
> in pthread_spin_lock.  I disabled the code to use spinlocks instead of
> regular mutexes, which immediately improved performance and also reduced
> CPU time by almost 50%.
>
> * The next round of profiling showed that a lot of the locking is in
> mem-pool code, and a lot of that in turn is from dictionary code.  Changing
> the dict code to use malloc/free instead of mem_get/mem_put gave another
> noticeable boost.
>
> At this point run time was down to 4:50, which is 20% better than where I
> started but still far short of non-multiplexed performance.  I can drive
> that down still further by converting more things to use malloc/free.
> There seems to be a significant opportunity here to improve performance -
> even without multiplexing - by taking a more careful look at our
> memory-management strategies:
>
> * Tune the mem-pool implementation to scale better with hundreds of
> threads.
>
> * Use mem-pools more selectively, or even abandon them altogether.
>
> * Try a different memory allocator such as jemalloc.
>
> I'd certainly appreciate some help/collaboration in studying these options
> further.  It's a great opportunity to make a large impact on overall
> performance without a lot of code or specialized knowledge.  Even so,
> however, I don't think memory management is our only internal scalability
> problem.  There must be something else limiting parallelism, and quite
> severely at that.  My first guess is io-threads, so I'll be looking into
> that first, but if anybody else has any ideas please let me know.  There's
> no *good* reason why running many bricks in one process should be slower
> than running them in separate processes.  If it remains slower, then the
> limit on the number of bricks and volumes we can support will remain
> unreasonably low.  Also, the problems I'm seeing here probably don't *only*
> affect multiplexing.  Excessive memory/CPU use and poor parallelism are
> issues that we kind of need to address anyway, so if anybody has any ideas
> please let me know.
>
>
>
> [1] http://review.gluster.org/#/c/14763/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160920/27c4da60/attachment.html>


More information about the Gluster-devel mailing list