[Gluster-devel] Brick multiplexing approaches

Mon Jun 13 18:19:31 UTC 2016

"Brick multiplexing" is a new feature, tentatively part of 4.0, that
allows multiple bricks to be served from a single glusterfsd process.
This promises to give us many benefits over the current "process per
brick" approach.

 * Lower total memory use, by having only one copy of various global
   structures instead of one per brick/process.

 * Less CPU contention.  Every glusterfsd process involves several
   threads.  If there are more total threads than physical cores, or if
   those cores are also needed for other work on the same system, we'll
   thrash pretty badly.  As with memory use, managing each thread type 
   as a single pool (instead of one pool per brick/process) will help.

 * Fewer ports.  In the extreme case, we need only have one process and
   one port per node.  This avoids port exhaustion for high node/brick
   counts, and can also be more firewall-friendly.

 * Better coordination between bricks e.g. to implement QoS policies.

In short, our current infrastructure just isn't going to let us keep up
with various trends - higher node counts, containers, hyperconvergence,
even erasure coding.  The question is: how do we do it?  There are two
basic models.

 * In the "multiple graph" model, we have multiple separate graphs
   (volfiles) in a single process.  This allows them to share logging
   data and threads, polling threads, and many other resources -
   everything anchored at a glusterfs_ctx_t.  It does *not* allow them 
   to share ports, or anything else anchored by a protocol/server
   translator instance.

 * In the "single graph" model, we have multiple graphs joined together
   at the single protocol/server translator.  This allows more things to
   be shared, including ports, but does introduce some new problems.
   For one thing, it doesn't work if the bricks have different transport
   characteristics (e.g. TLS vs. non-TLS).  For another, it raises the
   possibility of a configuration change for one brick causing a graph
   switch that affects all bricks in the process (even if they belong to
   separate volumes).

I'd prefer to implement the single-graph model, because both the port 
conservation/exhaustion and QoS-coordination issues are important.
However, that means solving some of the additional problems.  In
particular...

 * Most of the incompatible-transport issues can be solved by moving
   various things from the server translator's "private" structure
   (server_conf_t) into per-connection or per-tenant structures.  I've 
   already done something similar with the inode table for subvolume
   mounts (#13659 in Gerrit) and it's a pain but it's feasible.  We
   might also (eventually) need consider implementing parts of the
   multi-graph model as well to host bricks in the same process even
   when their transports are incompatible.

 * For the graph-switch problem, we'll need to introduce some idea of
   sub-graphs or related graphs, so that we can compare and switch only
   the part relevant to a single brick.  I'd actually like to avoid this
   entirely until we get to GlusterD 2.0, but I'm not sure if we'll be
   able to get away with that.

Suggestions, warnings, or other thoughts are welcome.