[Gluster-devel] Memory-management ideas

Tue Oct 4 20:52:49 UTC 2016

As in any large C codebase, memory leaks have long been a simmering
issue in GlusterFS.  More recently, in the context of my
brick-multiplexing work, I've found severe performance issues related to
memory management.  Almost simultaneously, multiple people have
mentioned that we're allocating and deallocating a lot of memory for
each request.  This is my attempt to collect some thoughts about how to
address these problems.

(1) Preallocate everything

When we launch a request, from FUSE or server or anywhere else, the
number of translators in the current graph (which is fixed for the life
of that request) sets an upper bound on how many stack frames we'll
need.  We could easily preallocate that many frames, reducing our total
allocation cost from O(n) to O(1).  As an extension of this idea, it
would also be possible for us to know the size of each translator's
"local" structure (if any) and preallocate those as well.  We'd still
need some sort of per-translator "destructor" to clean up things that
each local structure points to, but many translators already have those
and we could just call them directly when the request completes instead
of requiring every single fop function to do so.  Besides any
performance improvements, this single-allocation approach makes it
almost impossible to leak memory for the structures that it covers, and
significantly reduces boilerplate at the start/end of every fop in every
translator.  I've used variants of it multiple times in previous
projects, always with positive effects.

(2) Improve or eliminate memory pools

One surprising discovery in the multiplexing work is that the
memory-pool code has significant *negative* effects on performance due
to excessive locking.  Memory pools were supposed to improve performance
by avoiding the system memory allocator.  It might be possible to
develop new memory-pool code that delivers on that promise, by using
per-thread pools which almost eliminate lock contention and retire items
slowly/lazily into a common pool (another well known technique I've
applied to good effect elsewhere).  Alternatively, especially after
investigating some of the other ideas on this list, we might be better
off just eliminating our own memory pools entirely.

(3) Improve dictionaries and/or reduce their use

Despite previous optimizations, the dictionary code remains one of the largest generators of memory allocation/deallocation calls.  One way to address this would be to optimize our dictionary implementation further.  Another would be to reduce our reliance on dictionaries, e.g. by changing code that currently uses xdata to use fixed fields in a new version of our network protocol(s).

(4) Eliminate our custom memory-management code

Our memory-management code (GF_CALLOC and friends) does a lot of extra work (and consumes extra space) in an attempt to detect memory leaks.  If we applied some of the other ideas mentioned here to reduce the possibility of memory leaks, and relied on established tools - e.g. Coverity, clang, valgrind, Massif - to detect others, the cost of maintaining our own custom memory-management code is very likely to exceed its remaining benefit.

(5) Use a different memory allocator

Nobody believes that glibc malloc is the best performer out there, especially for multithreaded workloads like ours.  Other projects similar to ours seem to have done well with jemalloc, but there are others as well.  Switching to one of these might improve performance and/or predictability, even after other ideas here are considered.