[Gluster-devel] More multiplexing results

Fri Nov 4 03:23:23 UTC 2016

I know y'all are probably getting tired of these updates, but developing out in the open and all that.  Executive summary: the combination of disabling memory pools and using jemalloc makes multiplexing shine.  You can skip forward to ***RESULTS*** if you're not interested in my tired ramblings.

Let's talk about memory pools first.  I had identified this is a problem area a while ago, leading to a new memory-pool implementation[1].  I was rather proud of it, actually, but one of the lessons I've learned is that empirical results trump pride.  Another lesson is that it's really important to test performance-related changes on more than one kind of system.  On my default test system and at scale up to 100 volumes (400 bricks) the new mem-pool implementation was looking really good.  Unfortunately, at about 120 volumes it would run into a limit on the number of keys accessible via pthread_getspecific.  Well, crap.  I made some changes to overcome this limit, they hurt performance a little but I thought they'd save the effort.  Then I realized that there's *no limit* to how many pools a thread might use.  Each brick creates a dozen or so pools, and with multiplexing there's a potentially unlimited number of bricks in a process.  As a worker thread jumps from brick to brick, it might hit all of those pools.  This left three options.

(1) Bind threads to bricks, which I've already shown is bad for scalability.

(2) Tweak the mem-pool implementation to handle even more (almost infinitely more) thread/pool combinations, adding complexity and hurting performance even more.

(3) Reduce the number of pools by combining all pools for the same size.

Well, (3) sure sounds great, doesn't it?  There's only a couple of dozen sizes we use for pools, therefore only a couple of dozen pools no matter how many threads or bricks we have, and it's all wonderful.  We've also effectively reinvented yet another general-purpose memory allocator at that point, and there are a few of those out there already.  I hate letting my code die as much as anyone, but sometimes that's what the empirical results dictate must happen.  In fact, every developer should go through this periodically to keep them humble.  Lord knows I need that kind of lesson more often.  ;)

OK, so my own mem-pool implementation was out.  First experiment was to just disable mem-pools entirely (turn them into plain malloc/free) and see what the results were.  For these tests I used my standard create/write/delete 20,000 files test, on each of two different machines: a 16-core 8GB (artificially constrained) machine in Westford, and a 12-core 32GB machine with a much faster SSD at Digital Ocean.  The results were good on the Westford machine, with multiplexing mostly outperforming the alternative at scales up to 145 volumes (580 bricks).  However, on the DO machine multiplexing performance degraded badly even at 40 volumes.  Remember what I said about testing on multiple kind of machines?  This kind of result is why that matters.  My io-threads patch[2] seemed to help some, but not much.

Now it was time to revisit jemalloc.  Last time I looked at it, the benefit seemed minimal at best.  However, with the new load presented by the removal of memory pools, things were different this time.  Now performance remained smooth on the DO configuration with multiplexing up to 220 volumes.  Without multiplexing, I ran into a swap storm at 180 volumes and then everything died.  I mean *everything*; I had to do a hard reboot.  Similarly, on the Westford machine the current code died at 100 volumes while the multiplexing version was still going strong.  We have a winner.  With some more tweaking, I'm pretty confident that we'll be able to support 1000 bricks on a 32GB machine this way - not that anyone will have that many disks, but once we start slicing and dicing physical disks into smaller units for container-type workloads it's pretty easy to get there.

***RESULTS***

Relying on jemalloc instead of our own mem-pools will likely double the number of bricks we can support with the same memory (assuming further fixes to reduce memory over-use).  Also, performance of brick addition/removal is around 2x what it was before, because manipulating the graph in an existing process is a lot cheaper than starting a new one.  On the other hand, multiplexing performance is generally worse than non-multiplexed until we get close to those scalability limits.  We'll probably need to use an "adaptive" approach that will continue to use the current process-per-brick scheme until we get close to maximum capacity.

[1] http://review.gluster.org/#/c/15645/
[2] http://review.gluster.org/#/c/15643/