[Gluster-devel] Volume management proposal (4.0)

Wed Dec 3 14:47:21 UTC 2014

On 12/02/2014 10:07 AM, Jeff Darcy wrote:
> I've been thinking and experimenting around some of the things we need
> in this area to support 4.0 features, especially data classification
>
> http://www.gluster.org/community/documentation/index.php/Features/data-classification
>
> Before I suggest anything, a little background on how brick and volume
> management *currently* works.
>
> (1) Users give us bricks, specified by host:path pairs.
>
> (2) We assign each brick a unique ID, and create a "hidden" directory
> structure in .glusterfs to support our needs.
>
> (3) When bricks are combined into a volume, we create a bunch of
> volfiles.
>
> (4) There is one volfile per brick, consisting of a linear "stack" of
> translators from storage/posix (which interacts with the local file
> system) up to protocol/server (which listens for connections from
> clients).
>
> (5) When the volume is started, we start one glusterfsd process for each
> brick volfile.
>
> (6) There is also a more tree-like volfile for clients, constructed as
> follows:
>
> (6a) We start with a protocol/client translator for each brick.
>
> (6b) We combine bricks into N-way sets using AFR, EC, etc.
>
> (6c) We combine those sets using DHT.
>
> (6d) We push a bunch of (mostly performance-related) translators on top.
>
> (7) When a volume is mounted, we fetch the volume and instantiate all of
> the translators described there, plus mount/fuse to handle the local
> file system interface.  For GFAPI it's the same except for mount/fuse.
>
> (8) There are also volfiles for NFS, self-heal daemons, quota daemons,
> snapshots, etc.  I'm going to ignore those for now.
>
> The code for all of this is in glusterd-volgen.c, but I don't recommend
> looking at it for long because it's one of the ugliest hairballs I've
> ever seen.  In fact, you'd be hard pressed to recognize the above
> sequence of steps in that code.  Pieces that belong together are
> splattered all over.  Pieces that should remain separate are mashed
> together.  Pieces that should use common code use copied code instead.
> As a prerequisite for adding new functionality, what's already there
> needs to be heavily refactored so it makes some sense.
>
> So . . . about that new functionality.  The core idea of data
> classification is to apply step 6c repeatedly, with variants of DHT that
> do tiering or various other kinds of intelligent placement instead of
> the hash-based random placement we do now.  "NUFA" and "switch" are
> already examples of this.  In fact, their needs drove some of the code
> structure that makes data classification (DC) possible.
>
> The trickiest question with DC has always been how the user specifies
> these complex placement policies, which we then turn into volfiles.  In
> the interests of maximizing compatibility with existing scripts and user
> habits, what I propose is that we do this by allowing the user to
> combine existing volumes into a new higher-level volume.  This is
> similar to how the tiering prototype already works, except that
> "combining" volumes is more general than "attaching" a cache volume in
> that specific context.  There are also some other changes we should make
> to do this right.

As I read this I assume this is to ease administration, and not to ease 
the code complexity mentioned above, right?

The code complexity needs to be eased, but I would assume that is a by 
product of this change.

>
> (A) Each volume has an explicit flag indicating whether it is a
> "primary" volume to be mounted etc. directly by users or a "secondary"
> volume incorporated into another.
>
> (B) Each volume has a graph representing steps 6a through 6c above (i.e.
> up to DHT).  Only primary volumes have a (second) graph representing 6d
> and 7 as well.

Do we intend to break this up into multiple secondary volumes, i.e an 
admin can create a pure replicate secondary volume(s) and then create a 
further secondary volume from these adding, say DHT?

I ask this for 2 reasons,
If we bunch up everything till 6c, we may not reduce admin complexity 
when creating volumes that involve multiple tiers, so we should/could 
allow creating secondary volumes and then further secondary volumes.

If we do _not_ bunch up then we would have several secondary volumes, 
then the settings (as I think about it) for each secondary volume 
becomes a bit more non-intuitive. IOW, we are dealing with a chain of 
secondary volumes and each with its own name, and would initiate admin 
operations (like rebalance) on possibly each of these. Not sure if I am 
portraying the complexity that I see well here.

>
> (C) The graph/volfile for a primary volume might contain references to
> secondary volumes.  These references are resolved at the same time that
> 6d and 7 are applied, yielding a complete graph without references.
>
> (D) Secondary volumes may not be started and stopped by the user.
> Instead, a secondary volume is automatically started or stopped along
> with its primary.
>
> (E) The user must specify an explicit option to see the status of
> secondary volumes.  Without this option, secondary volumes are hidden
> and status for their constituent bricks will be shown as though they
> were (directly) part of the corresponding primary volume.
>
> As it turns out, most of the "extra" volfiles in step 8 above also
> have their own steps 6d and 7, so implementing step C will probably make
> those paths simpler as well.
>
> The one big remaining question is how this will work in terms of
> detecting and responding to volume configuration changes.  Currently we
> treat each volfile as a completely independent entity, and just compare
> whole graphs.  Instead, what we need to do is track dependencies between
> graphs (a graph of graphs?) so that a change to a secondary volume will
> "ripple up" to its primary where a new graph can be generated and
> compared to its predecessor.
>
> Any other thoughts/suggestions?

Maybe a brief example of how this works would help clarify some thoughts.

Thanks,
Shyam