[Gluster-devel] Volume management proposal (4.0)
Shyam
srangana at redhat.com
Wed Dec 3 14:47:21 UTC 2014
On 12/02/2014 10:07 AM, Jeff Darcy wrote:
> I've been thinking and experimenting around some of the things we need
> in this area to support 4.0 features, especially data classification
>
> http://www.gluster.org/community/documentation/index.php/Features/data-classification
>
> Before I suggest anything, a little background on how brick and volume
> management *currently* works.
>
> (1) Users give us bricks, specified by host:path pairs.
>
> (2) We assign each brick a unique ID, and create a "hidden" directory
> structure in .glusterfs to support our needs.
>
> (3) When bricks are combined into a volume, we create a bunch of
> volfiles.
>
> (4) There is one volfile per brick, consisting of a linear "stack" of
> translators from storage/posix (which interacts with the local file
> system) up to protocol/server (which listens for connections from
> clients).
>
> (5) When the volume is started, we start one glusterfsd process for each
> brick volfile.
>
> (6) There is also a more tree-like volfile for clients, constructed as
> follows:
>
> (6a) We start with a protocol/client translator for each brick.
>
> (6b) We combine bricks into N-way sets using AFR, EC, etc.
>
> (6c) We combine those sets using DHT.
>
> (6d) We push a bunch of (mostly performance-related) translators on top.
>
> (7) When a volume is mounted, we fetch the volume and instantiate all of
> the translators described there, plus mount/fuse to handle the local
> file system interface. For GFAPI it's the same except for mount/fuse.
>
> (8) There are also volfiles for NFS, self-heal daemons, quota daemons,
> snapshots, etc. I'm going to ignore those for now.
>
> The code for all of this is in glusterd-volgen.c, but I don't recommend
> looking at it for long because it's one of the ugliest hairballs I've
> ever seen. In fact, you'd be hard pressed to recognize the above
> sequence of steps in that code. Pieces that belong together are
> splattered all over. Pieces that should remain separate are mashed
> together. Pieces that should use common code use copied code instead.
> As a prerequisite for adding new functionality, what's already there
> needs to be heavily refactored so it makes some sense.
>
> So . . . about that new functionality. The core idea of data
> classification is to apply step 6c repeatedly, with variants of DHT that
> do tiering or various other kinds of intelligent placement instead of
> the hash-based random placement we do now. "NUFA" and "switch" are
> already examples of this. In fact, their needs drove some of the code
> structure that makes data classification (DC) possible.
>
> The trickiest question with DC has always been how the user specifies
> these complex placement policies, which we then turn into volfiles. In
> the interests of maximizing compatibility with existing scripts and user
> habits, what I propose is that we do this by allowing the user to
> combine existing volumes into a new higher-level volume. This is
> similar to how the tiering prototype already works, except that
> "combining" volumes is more general than "attaching" a cache volume in
> that specific context. There are also some other changes we should make
> to do this right.
As I read this I assume this is to ease administration, and not to ease
the code complexity mentioned above, right?
The code complexity needs to be eased, but I would assume that is a by
product of this change.
>
> (A) Each volume has an explicit flag indicating whether it is a
> "primary" volume to be mounted etc. directly by users or a "secondary"
> volume incorporated into another.
>
> (B) Each volume has a graph representing steps 6a through 6c above (i.e.
> up to DHT). Only primary volumes have a (second) graph representing 6d
> and 7 as well.
Do we intend to break this up into multiple secondary volumes, i.e an
admin can create a pure replicate secondary volume(s) and then create a
further secondary volume from these adding, say DHT?
I ask this for 2 reasons,
If we bunch up everything till 6c, we may not reduce admin complexity
when creating volumes that involve multiple tiers, so we should/could
allow creating secondary volumes and then further secondary volumes.
If we do _not_ bunch up then we would have several secondary volumes,
then the settings (as I think about it) for each secondary volume
becomes a bit more non-intuitive. IOW, we are dealing with a chain of
secondary volumes and each with its own name, and would initiate admin
operations (like rebalance) on possibly each of these. Not sure if I am
portraying the complexity that I see well here.
>
> (C) The graph/volfile for a primary volume might contain references to
> secondary volumes. These references are resolved at the same time that
> 6d and 7 are applied, yielding a complete graph without references.
>
> (D) Secondary volumes may not be started and stopped by the user.
> Instead, a secondary volume is automatically started or stopped along
> with its primary.
>
> (E) The user must specify an explicit option to see the status of
> secondary volumes. Without this option, secondary volumes are hidden
> and status for their constituent bricks will be shown as though they
> were (directly) part of the corresponding primary volume.
>
> As it turns out, most of the "extra" volfiles in step 8 above also
> have their own steps 6d and 7, so implementing step C will probably make
> those paths simpler as well.
>
> The one big remaining question is how this will work in terms of
> detecting and responding to volume configuration changes. Currently we
> treat each volfile as a completely independent entity, and just compare
> whole graphs. Instead, what we need to do is track dependencies between
> graphs (a graph of graphs?) so that a change to a secondary volume will
> "ripple up" to its primary where a new graph can be generated and
> compared to its predecessor.
>
> Any other thoughts/suggestions?
Maybe a brief example of how this works would help clarify some thoughts.
Thanks,
Shyam
More information about the Gluster-devel
mailing list