[Gluster-devel] Brick multiplexing approaches

Tue Jun 14 15:19:35 UTC 2016

On 06/13/2016 02:19 PM, Jeff Darcy wrote:
> "Brick multiplexing" is a new feature, tentatively part of 4.0, that
> allows multiple bricks to be served from a single glusterfsd process.
> This promises to give us many benefits over the current "process per
> brick" approach.
>
>  * Lower total memory use, by having only one copy of various global
>    structures instead of one per brick/process.
>
>  * Less CPU contention.  Every glusterfsd process involves several
>    threads.  If there are more total threads than physical cores, or if
>    those cores are also needed for other work on the same system, we'll
>    thrash pretty badly.  As with memory use, managing each thread type
>    as a single pool (instead of one pool per brick/process) will help.
>
>  * Fewer ports.  In the extreme case, we need only have one process and
>    one port per node.  This avoids port exhaustion for high node/brick
>    counts, and can also be more firewall-friendly.
>
>  * Better coordination between bricks e.g. to implement QoS policies.
>
> In short, our current infrastructure just isn't going to let us keep up
> with various trends - higher node counts, containers, hyperconvergence,
> even erasure coding.  The question is: how do we do it?  There are two
> basic models.
>
>  * In the "multiple graph" model, we have multiple separate graphs
>    (volfiles) in a single process.  This allows them to share logging
>    data and threads, polling threads, and many other resources -
>    everything anchored at a glusterfs_ctx_t.  It does *not* allow them
>    to share ports, or anything else anchored by a protocol/server
>    translator instance.
>
>  * In the "single graph" model, we have multiple graphs joined together
>    at the single protocol/server translator.  This allows more things to
>    be shared, including ports, but does introduce some new problems.
>    For one thing, it doesn't work if the bricks have different transport
>    characteristics (e.g. TLS vs. non-TLS).  For another, it raises the
>    possibility of a configuration change for one brick causing a graph
>    switch that affects all bricks in the process (even if they belong to
>    separate volumes).

There is *currently* no graph switch on the bricks (as I understand it). 
Configuration changes, yes, but no graph switch as the xlator pipeline 
is fixed, if that changes the bricks need to be restarted. Others can 
correct me if I am wrong.

Noting the above here, as it may not be that big a deal. Also noting the 
'currently' in bold, as the future could mean something different.

>
> I'd prefer to implement the single-graph model, because both the port
> conservation/exhaustion and QoS-coordination issues are important.
> However, that means solving some of the additional problems.  In
> particular...
>
>  * Most of the incompatible-transport issues can be solved by moving
>    various things from the server translator's "private" structure
>    (server_conf_t) into per-connection or per-tenant structures.  I've
>    already done something similar with the inode table for subvolume
>    mounts (#13659 in Gerrit) and it's a pain but it's feasible.  We
>    might also (eventually) need consider implementing parts of the
>    multi-graph model as well to host bricks in the same process even
>    when their transports are incompatible.
>
>  * For the graph-switch problem, we'll need to introduce some idea of
>    sub-graphs or related graphs, so that we can compare and switch only
>    the part relevant to a single brick.  I'd actually like to avoid this
>    entirely until we get to GlusterD 2.0, but I'm not sure if we'll be
>    able to get away with that.

The sub-graph model seems the best for certain other things, like 
preserving the inode and other tables as is by the master xlator. It 
does introduce the onus of keeping the inodes (and fd's) current on the 
xlators though (watering this down to sub-graph at the sub-brick level 
is possible, but that would be 'watering' this concept down). This needs 
some more thought, but I do like the direction.

>
> Suggestions, warnings, or other thoughts are welcome.
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>