[Gluster-devel] Volume management proposal (4.0)

Wed Dec 3 05:01:21 UTC 2014


On 12/02/2014 08:37 PM, Jeff Darcy wrote:
> I've been thinking and experimenting around some of the things we need
> in this area to support 4.0 features, especially data classification
> 
> http://www.gluster.org/community/documentation/index.php/Features/data-classification
> 
> Before I suggest anything, a little background on how brick and volume
> management *currently* works.
> 
> (1) Users give us bricks, specified by host:path pairs.
> 
> (2) We assign each brick a unique ID, and create a "hidden" directory
> structure in .glusterfs to support our needs.
> 
> (3) When bricks are combined into a volume, we create a bunch of
> volfiles.
> 
> (4) There is one volfile per brick, consisting of a linear "stack" of
> translators from storage/posix (which interacts with the local file
> system) up to protocol/server (which listens for connections from
> clients).
> 
> (5) When the volume is started, we start one glusterfsd process for each
> brick volfile.
> 
> (6) There is also a more tree-like volfile for clients, constructed as
> follows:
> 
> (6a) We start with a protocol/client translator for each brick.
> 
> (6b) We combine bricks into N-way sets using AFR, EC, etc.
> 
> (6c) We combine those sets using DHT.
> 
> (6d) We push a bunch of (mostly performance-related) translators on top.
> 
> (7) When a volume is mounted, we fetch the volume and instantiate all of
> the translators described there, plus mount/fuse to handle the local
> file system interface.  For GFAPI it's the same except for mount/fuse.
> 
> (8) There are also volfiles for NFS, self-heal daemons, quota daemons,
> snapshots, etc.  I'm going to ignore those for now.
> 
> The code for all of this is in glusterd-volgen.c, but I don't recommend
> looking at it for long because it's one of the ugliest hairballs I've
> ever seen.  In fact, you'd be hard pressed to recognize the above
> sequence of steps in that code.  Pieces that belong together are
> splattered all over.  Pieces that should remain separate are mashed
> together.  Pieces that should use common code use copied code instead.
> As a prerequisite for adding new functionality, what's already there
> needs to be heavily refactored so it makes some sense.
> 
> So . . . about that new functionality.  The core idea of data
> classification is to apply step 6c repeatedly, with variants of DHT that
> do tiering or various other kinds of intelligent placement instead of
> the hash-based random placement we do now.  "NUFA" and "switch" are
> already examples of this.  In fact, their needs drove some of the code
> structure that makes data classification (DC) possible.
> 
> The trickiest question with DC has always been how the user specifies
> these complex placement policies, which we then turn into volfiles.  In
> the interests of maximizing compatibility with existing scripts and user
> habits, what I propose is that we do this by allowing the user to
> combine existing volumes into a new higher-level volume.  This is
> similar to how the tiering prototype already works, except that
> "combining" volumes is more general than "attaching" a cache volume in
> that specific context.  There are also some other changes we should make
> to do this right.
> 
> (A) Each volume has an explicit flag indicating whether it is a
> "primary" volume to be mounted etc. directly by users or a "secondary"
> volume incorporated into another.
> 
> (B) Each volume has a graph representing steps 6a through 6c above (i.e.
> up to DHT).  Only primary volumes have a (second) graph representing 6d
> and 7 as well.
> 
> (C) The graph/volfile for a primary volume might contain references to
> secondary volumes.  These references are resolved at the same time that
> 6d and 7 are applied, yielding a complete graph without references.
> 
> (D) Secondary volumes may not be started and stopped by the user.
> Instead, a secondary volume is automatically started or stopped along
> with its primary.
> 
> (E) The user must specify an explicit option to see the status of
> secondary volumes.  Without this option, secondary volumes are hidden
> and status for their constituent bricks will be shown as though they
> were (directly) part of the corresponding primary volume.
IIUC, secondary volumes are internal representations and do not get
exposed to the user, then why do we need to provide an explicit option
for the status? Correct me if my understanding is wrong.

~Atin
> 
> As it turns out, most of the "extra" volfiles in step 8 above also
> have their own steps 6d and 7, so implementing step C will probably make
> those paths simpler as well.
> 
> The one big remaining question is how this will work in terms of
> detecting and responding to volume configuration changes.  Currently we
> treat each volfile as a completely independent entity, and just compare
> whole graphs.  Instead, what we need to do is track dependencies between
> graphs (a graph of graphs?) so that a change to a secondary volume will
> "ripple up" to its primary where a new graph can be generated and
> compared to its predecessor.
> 
> Any other thoughts/suggestions?
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>