[Gluster-devel] Memory-pool experiments

Thu Oct 13 14:08:53 UTC 2016

I spent yesterday following up on one of the items from the Berlin
summit: memory usage and allocation-related performance.  Besides being
a general concern, this is a very specific worry with multiplexing
because some of our code in this area scales very poorly to the hundreds
of threads per process that multiplexing can create.  For example, one
fairly straightforward performance test (detailed later) ran so slowly
at first that I stopped it after it had run for three times as long as
without multiplexing, and AFAICT it hadn't even reached its half-way
point.  Even with some fairly significant tweaking (mostly turning
spinlocks into mutexes and raising various thread counts), it still ran
30% slower than normal.  Hence, this investigation.

** BASELINE TESTING **

The test I used for most of this is fairly simple.  First, I create 20
volumes, each 2x2 so 80 bricks total.  Then, a first pass of the test 
creates 1000 empty files on each volume.  Lastly, a second pass writes
(synchronously) and then unlinks each file.  The test is deliberately
constructed so that it doesn't use a ton of disk space, which allows it
to run on ramdisks and isolate CPU/memory performance issues instead of
always waiting for disk.  Because each ramdisk is served by only a
single kernel thread[1], using many of those is also important.

On my test machine, using ramdisks, this test usually runs in about five 
minutes.  That's a good length - long enough to provide meaningful
results but short enough to allow rapid iteration on tuning or new code.
In fact, with current code from master and no tweaking at all, it takes
*exactly* five minutes.  By converting GF_LOCK from a spinlock to a 
mutex I can get a modest gain, down to 4:41.

As I said before, this test effectively couldn't even run with 
multiplexing unless the GF_LOCK change was present, so for the remainder
of this document just assume it's there.  My testing was also done with 
a version of io-threads that shares its thread pool across all instances
(i.e. bricks) instead of allocating a separate pool for each.  This was
a result of previous multiplexing-performance work, and again necessary
for tests to run in reasonable time.

With multiplexing and no other tuning, the time is 6:46.  This is where
I started profiling, and found the following chain:

  Most time was still spent in locking.

  Most of that was from memory-pool code.

  Most of that was from dictionary code, though a couple of other
  culprits (call stubs and stacks/frames) were also noticeable.

[1] Fun fact: there used to be a multi-threaded version of the loopback
driver, before 2.6 IIRC.  Then, for reasons that are unclear and
probably very stupid, a Red Hat engineer (someone whose name many of you
would know) reverted it to being single-threaded.  This particular piece
of kernel lameness was a significant pain point at my last job, but I've 
digressed too long already to explain why. 

** EASY EXPERIMENTS **

With the above observations in mind, I tried two experiments.  First, I
tried disabling memory pools entirely - i.e. just call through to
malloc/free instead.  This got the time down to 5:59.  As an
alternative, just changing some of the major memory-pool users mentioned
above to call malloc/free directly (leaving them in place for other
users) improved things a little bit more.

  Modified dictionary and call-stub code: 6:14

  Modified stack/frame code as well: 5:51

The first result is faster than we started, but still slower than
disabling memory pools entirely.  The second result is faster still.  My
conclusion was that memory pools are still hurting more than they help,
but the benefit from avoiding GF_*ALLOC/GF_FREE as well in these three
cases outweighed the ongoing cost of leaving memory pools in place for
everything else.  Put another way, some hypothetical memory-pool
approach might help by GF_*ALLOC/GF_FREE overhead, but the
implementation we have is insufficient.  I also tried an experiment with
*both* these changes and memory-pools fully disabled.  This yielded a
time of 5:48, further supporting my theory.

At this point, since the result relies more heavily on malloc/free
performance than before, it seemed like a good time to try jemalloc.
By accident, I did this with a version that only included the dictionary
and call-stub code but not the stack/frame code.  This resulted in only
a very modest improvement - 6:11 vs. 6:14 before.  Glad I tried, but so
much for that experiment.

** NEW MEMORY POOL DESIGN **

With all of this preliminary work out of the way, it finally seemed like
it was worth trying to improve the memory-pool code.  For this, I relied
on a design I've used successfully elsewhere:

  Within each pool (e.g. dictionary objects) each thread gets its own
  private sub-pool from which it can draw.

  Each per-thread pool is divided into hot and cold lists.  Every
  allocation first attempts to use the hot list, then the cold list.
  When objects are freed, they always go on the hot list.

  Each global pool has a "pool sweeper" thread, which will wake up every
  N seconds (by default N=10).  For each per-thread pool, it will delete
  everything on the cold list and move the hot list to the cold list
  (hot list is now empty).

This design has several good properties.

  Very low lock contention.  Per-thread pool locks have to exist for
  correctness, but these locks are never contended between normal
  threads.  The only contention is with the pool sweeper, and that's
  still *very* rare.

  On a busy system, visits to the system memory allocator will tend
  toward zero as objects are all found on the per-thread hot list.

  Even on a system with cyclic behavior, where the cycle is less than N
  seconds, system-memory-allocator visits will still tend toward zero as
  items are found on the per-thread cold list.

  Memory usage at idle tends toward zero after 2N seconds, as every
  object gets moved to the cold list on one pool-sweeper iteration and
  then deleted on the next.

  With a "last in, first out" implementation, locality of reference and
  cache warmth (assuming a decent thread scheduler) are very good.

** CONCLUSION **

With an implementation of the above design, time was down to 5:34.  The
fact that this is better than both the memory-pool-disabled time of 5:59
and the modify-common-callers result of of 5:51 indicates that we've
crossed the point of memory pools providing an overall benefit in terms
of reducing calls to either GF_*ALLOC or plain malloc.  I haven't tried
this without multiplexing yet, but since the differences there were
small anyway I expect they'd provide a modest improvement.