[Gluster-devel] Volume management proposal (4.0)

Tue Dec 2 15:07:56 UTC 2014

I've been thinking and experimenting around some of the things we need
in this area to support 4.0 features, especially data classification

http://www.gluster.org/community/documentation/index.php/Features/data-classification

Before I suggest anything, a little background on how brick and volume
management *currently* works.

(1) Users give us bricks, specified by host:path pairs.

(2) We assign each brick a unique ID, and create a "hidden" directory
structure in .glusterfs to support our needs.

(3) When bricks are combined into a volume, we create a bunch of
volfiles.

(4) There is one volfile per brick, consisting of a linear "stack" of
translators from storage/posix (which interacts with the local file
system) up to protocol/server (which listens for connections from
clients).

(5) When the volume is started, we start one glusterfsd process for each
brick volfile.

(6) There is also a more tree-like volfile for clients, constructed as
follows:

(6a) We start with a protocol/client translator for each brick.

(6b) We combine bricks into N-way sets using AFR, EC, etc.

(6c) We combine those sets using DHT.

(6d) We push a bunch of (mostly performance-related) translators on top.

(7) When a volume is mounted, we fetch the volume and instantiate all of
the translators described there, plus mount/fuse to handle the local
file system interface.  For GFAPI it's the same except for mount/fuse.

(8) There are also volfiles for NFS, self-heal daemons, quota daemons,
snapshots, etc.  I'm going to ignore those for now.

The code for all of this is in glusterd-volgen.c, but I don't recommend
looking at it for long because it's one of the ugliest hairballs I've
ever seen.  In fact, you'd be hard pressed to recognize the above
sequence of steps in that code.  Pieces that belong together are
splattered all over.  Pieces that should remain separate are mashed
together.  Pieces that should use common code use copied code instead.
As a prerequisite for adding new functionality, what's already there
needs to be heavily refactored so it makes some sense.

So . . . about that new functionality.  The core idea of data
classification is to apply step 6c repeatedly, with variants of DHT that
do tiering or various other kinds of intelligent placement instead of
the hash-based random placement we do now.  "NUFA" and "switch" are
already examples of this.  In fact, their needs drove some of the code
structure that makes data classification (DC) possible.

The trickiest question with DC has always been how the user specifies
these complex placement policies, which we then turn into volfiles.  In
the interests of maximizing compatibility with existing scripts and user
habits, what I propose is that we do this by allowing the user to
combine existing volumes into a new higher-level volume.  This is
similar to how the tiering prototype already works, except that
"combining" volumes is more general than "attaching" a cache volume in
that specific context.  There are also some other changes we should make
to do this right.

(A) Each volume has an explicit flag indicating whether it is a
"primary" volume to be mounted etc. directly by users or a "secondary"
volume incorporated into another.

(B) Each volume has a graph representing steps 6a through 6c above (i.e.
up to DHT).  Only primary volumes have a (second) graph representing 6d
and 7 as well.

(C) The graph/volfile for a primary volume might contain references to
secondary volumes.  These references are resolved at the same time that
6d and 7 are applied, yielding a complete graph without references.

(D) Secondary volumes may not be started and stopped by the user.
Instead, a secondary volume is automatically started or stopped along
with its primary.

(E) The user must specify an explicit option to see the status of
secondary volumes.  Without this option, secondary volumes are hidden
and status for their constituent bricks will be shown as though they
were (directly) part of the corresponding primary volume.

As it turns out, most of the "extra" volfiles in step 8 above also
have their own steps 6d and 7, so implementing step C will probably make
those paths simpler as well.

The one big remaining question is how this will work in terms of
detecting and responding to volume configuration changes.  Currently we
treat each volfile as a completely independent entity, and just compare
whole graphs.  Instead, what we need to do is track dependencies between
graphs (a graph of graphs?) so that a change to a secondary volume will
"ripple up" to its primary where a new graph can be generated and
compared to its predecessor.

Any other thoughts/suggestions?