[Gluster-devel] Glusterd 'Management Volume' proposal

Wed Dec 3 17:49:45 UTC 2014

Top posting as these are mostly queries, than a comment on MV as 
described below.

1) With the current scheme in glusterd the O(N^2) is because the 
configuration is replicated to every peer in the cluster, correct?

- In the new approach (either MV or otherwise), the idea is to maintain 
a configuration cluster, or a set of nodes that have configuration 
related information in them, correct?

- The rest of the peers get the latest configuration as this changes, 
(which is the watch functionality that Jeff brings out) this part of the 
requirement is not covered in the proposal. Would help if this is 
elaborated as well.

- We do have the limitation now that some clients _may_ not have got the 
latest graph (one of the configuration items here), with the new 
proposal, is there any thought regarding resolving the same? Is it 
required? I assume brick nodes, have this strict enforcement today as 
well as in the future.

2) With a >1000 node setup, is it intended that we have a cascade 
functionality to handle configuration changes? I.e there are a defined 
set of _watchers_ to the configuration cluster, and each in turn serve a 
set of peers for their _watch_ functionality?

This maybe an overkill (i.e requiring cascading), but is it required 
when we consider cases like Geo-rep or tiers in different data centers 
etc. that need configuration updates and all of them watching the 
configuration cluster maybe a problem requiring attention?

Onto MV proposal,
- Using a smaller, pure replicate, gluster volume, sans a few xlators, 
with locking enforced by the consumers of the same seems like a good way 
to solve the replication, consistency and hence availability of the 
configuration information.

- And as you mention, a POSIX-y interface and an application on top of 
the same, seems heavy weight for a key-value store, that a configuration 
volume presents.

- We still need watcher functionality and possibly cascading support

I am _not_ well aware of the internals of etcd (or other frameworks 
being discussed) to compare what we can leverage from the same and what 
functionality it lacks, or the production worthiness of the code.

Going by your initial statements on the same, the concern seems to be 
dependency on another component, in terms of releases and required bug 
fixes. I would go on to state that if the infrastructure is production 
ready, then managing the dependency would be relatively easier. The real 
challenge would be how much effort needs to be spent understanding the 
internals and if we need to do the same, for us to be able to support 
this in gluster deployments. Any clues or ideas on this, to help make a 
decision?

Shyam

On 11/19/2014 02:22 AM, Krishnan Parthasarathi wrote:
> All,
>
> We have been thinking of many approaches to address some of Glusterd's correctness
> (during failures and at scale) and scalability concerns. A recent email thread on
> Glusterd-2.0 was along these lines. While that discussion is still valid, we have been
> considering dogfooding as a viable option to solve our problems. This is not the first
> time this has been mentioned but for various reasons didn't really take off. The following
> proposal solves Glusterd's requirement for a distributed (consistent) store using a GlusterFS
> volume. Then who manages that GlusterFS volume? To find answers for that and more
> read further.
>
> [The following content is also available here: https://gist.github.com/krisis/945e45e768ef1c4e446d
> Please keep the discussions on the mailing list and _not_ in github, for tractibility
> reasons.]
>
>
> ##Abstract
>
> Glusterd, the management daemon for GlusterFS, maintains volume and cluster
> configuration store using an home-grown replication algorithm. Some shortcomings
> are as follows.
>
> - Involves O(N^2) (in number of nodes) network messages to replicate
>    configuration changes for every command
>
> - Doesn't rely on quorum and not resilient to network partitions
>
> - Recovery of nodes that come back online can choke the network at scale
>
> The thousand node glusterd proposal[1], one of the more mature proposals
> addressing the above problems, recommends use of a consistent distributed
> stores like consul/etcd for maintaining the volume and cluster configuration.
> While the technical merits of this approach make it compelling the operational
> challenges like coordinating between the two communities for releases and
> bug-fixes could get out of hand.  An alternate approach[2] is to use a
> replicated GlusterFS volume as the distributed store instead. The remainder of
> this email explains how a GlusterFS volume could be used to store configuration
> information.
>
>
> ##Technical details
>
> We will refer to the replicated GlusterFS volume used for storing configuration
> as the Management volume (MV). The following section describes how MV would be
> managed.
>
>
> ###MV management
>
> To begin with we can restrict the MV to a pure replicated volume with a maximum
> of 3 bricks on 3 different nodes[3]. The brick path can be stored in glusterd.vol
> which is packaged. MV will come into existence only after the first peer probe
> or first volume create operation.
>
> The following example of setting up a Glusterfs storage cluster highlights how
> things work in the proposed scheme of things.
>
> - Install glusterfs server packages on a storage node.
>
> - Start glusterd service.
>
> - Create a volume. --> Now, the MV is created with one brick and mounted under
>    /var/lib/glusterd
>
> - Add a peer to the cluster --> Now, MV is expanded to a 2-way replicated
>    volume with the second brick in the new peer. MV is mounted in the new peer
>    under /var/lib/glusterd.
>
> - Create more volumes.
>
> - Add the third peer to the cluster --> MV is expanded to a 3-way replicated
>    volume with the third brick in the new peer. MV is mounted under
>    /var/lib/glusterd in the new peer. This is the last time MV is expanded.
>
> - Any further peers added to the cluster would only mount the MV under
>    /var/lib/glusterd.
>
> The above restrictions placed on MV allow us to escape the need for a robust distributed
> store for MV's volume information and volume files.
>
> ###Configuration details of MV
> - peers that are hosting bricks for MV would have a boolean option in glusterd.vol.
> For e.g something like,
>          option mv_host on
>
> - The brick path for MV would have a default from the packaged glusterd.vol
> For e.g,
>          option mv_brick /mv/brick
>
> - Replica count. This could be stored as part of glusterd.vol too.
> For e.g,
>          option mv_replica 3
>
> - The ports for MV bricks could be reserved by glusterd's port mappper.  For
>    e.g, 49152 could be reserved for MV brick on each node, given that we would
>    have only one MV brick per peer.
>
> - options to be set on volume - client-quorum, optionally proactive self heal enabled.
>
> - MV would benefit from client-side quorum, server-side quorum and other
>    options. These could be preset (packaged) as part of glusterd.vol too.
>
> - With brick path, ports and volume options present in glusterd.vol or preset
>    we can build the in-memory volume info representation on initialization of
>    glusterd.  This means we can generate MV's volume file dynamically in each MV
>    hosting peer when needed and store in a 'known' location in local disk.
>
>
> ###Changes in glusterd command execution
>
> Each peer modifies its configuration in /var/lib/glusterd in the commit phase
> of every command execution. With the introduction of MV, the peer in which the
> command is executed will perform the modifications to the configuration in
> /var/lib/glusterd after commit phase on the remaining available peers. Note,
> the other nodes don't perform any updates to MV.
>
>
> ###How to replace a 'dead' server/peer?
>
> At the moment, I haven't thought of an automatic (or near semi-automatic) way
> of replacing a 'dead' peer. The manual steps should be as follows,
>
> - If the 'dead' peer doesn't host MV bricks then the procedure as in previous
>    versions. This approach doesn't change anything.
>
> - Provision a new server. Install glusterfs packages.
>
> - Modify the glusterd.vol to have
>          option mv_host on
>          option mv_replica 3 #as the case may be
>
> - Probe the peer to the cluster. glusterd on initialization would replace its
>    MV brick in MV and replication's healing should replicate the configuration.
>
> N.B This procedure assumes default MV config parameters. For non-default configuration,
> the brick path should also be updated in glusterd.vol in the new peer.
>
>
> ###How to upgrade from current version?
>
> Following would be the steps,
>
> - Stop all gluster{d,fs,fsd} processes by stopping the corresponding services.
>
> - Upgrade to this version of glusterfs packages.
>
> - Choose at most 3 servers/peers to build MV. In these nodes, create the
>    default brick directories; modify the (new) glusterd.vol to have
>          option mv_host on
>    Set replica count on each peer's glusterd.vol
>          option mv_replica 3 #say
>
> - Move /var/lib/glusterd contents on each peer a to a temporary directory.
>    Say, /var/lib/glusterd.bkp
>
> - Start glusterd service on one of the nodes, in 'upgrade' mode. In this mode,
>    glusterd would start the MV bricks and mount it on /var/lib/glusterd. It will
>    not serve cli or mount requests.
>
> - Copy the contents of /var/lib/glusterd.bkp on to (the mounted)
>    /var/lib/glusterd.
>
> - Repeat this on all nodes in the cluster.
>
> - Stop glusterd on all nodes. Start glusterd service on all nodes (in 'normal'
>    mode).
>
> - Now the storage cluster should be ready for improved operations.
>
>
> ###How to upgrade from this version to future versions?
>
> This is trickier than it should be given that we are holding MV's configuration
> in glusterd.vol, which is packaged. I would like to hear from the community for
> suggestions on this.
>
>
> ###References
> [1] - http://www.gluster.org/community/documentation/index.php/Features/thousand-node-glusterd.
>
> [2] - This approach was initially recommended by Jeff Darcy, who is also the
> author of [1].
>
> [3] - It shouldn't be hard to allow expanding MV beyond 3 bricks but most distributed configuration
>        stores recommend 3 or 5 way replication. At the least this could be made configurable
>        via glusterd.vol.
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>