[Gluster-devel] Glusterd 'Management Volume' proposal

Mon Nov 24 06:43:21 UTC 2014

All,

It would be really helpful to hear some feedback on this proposal. It is important
that we solve Glusterd's configuration replication problem at scale. Getting this
right would help us be better prepared for the kind of changes planned for GlusterFS 4.0.

thanks,
kp

----- Original Message -----
> All,
> 
> We have been thinking of many approaches to address some of Glusterd's
> correctness
> (during failures and at scale) and scalability concerns. A recent email
> thread on
> Glusterd-2.0 was along these lines. While that discussion is still valid, we
> have been
> considering dogfooding as a viable option to solve our problems. This is not
> the first
> time this has been mentioned but for various reasons didn't really take off.
> The following
> proposal solves Glusterd's requirement for a distributed (consistent) store
> using a GlusterFS
> volume. Then who manages that GlusterFS volume? To find answers for that and
> more
> read further.
> 
> [The following content is also available here:
> https://gist.github.com/krisis/945e45e768ef1c4e446d
> Please keep the discussions on the mailing list and _not_ in github, for
> tractibility
> reasons.]
>   
> 
> ##Abstract
> 
> Glusterd, the management daemon for GlusterFS, maintains volume and cluster
> configuration store using an home-grown replication algorithm. Some
> shortcomings
> are as follows.
> 
> - Involves O(N^2) (in number of nodes) network messages to replicate
>   configuration changes for every command
> 
> - Doesn't rely on quorum and not resilient to network partitions
> 
> - Recovery of nodes that come back online can choke the network at scale
> 
> The thousand node glusterd proposal[1], one of the more mature proposals
> addressing the above problems, recommends use of a consistent distributed
> stores like consul/etcd for maintaining the volume and cluster configuration.
> While the technical merits of this approach make it compelling the
> operational
> challenges like coordinating between the two communities for releases and
> bug-fixes could get out of hand.  An alternate approach[2] is to use a
> replicated GlusterFS volume as the distributed store instead. The remainder
> of
> this email explains how a GlusterFS volume could be used to store
> configuration
> information.
> 
> 
> ##Technical details
> 
> We will refer to the replicated GlusterFS volume used for storing
> configuration
> as the Management volume (MV). The following section describes how MV would
> be
> managed.
> 
> 
> ###MV management
> 
> To begin with we can restrict the MV to a pure replicated volume with a
> maximum
> of 3 bricks on 3 different nodes[3]. The brick path can be stored in
> glusterd.vol
> which is packaged. MV will come into existence only after the first peer
> probe
> or first volume create operation.
> 
> The following example of setting up a Glusterfs storage cluster highlights
> how
> things work in the proposed scheme of things.
> 
> - Install glusterfs server packages on a storage node.
> 
> - Start glusterd service.
> 
> - Create a volume. --> Now, the MV is created with one brick and mounted
> under
>   /var/lib/glusterd
> 
> - Add a peer to the cluster --> Now, MV is expanded to a 2-way replicated
>   volume with the second brick in the new peer. MV is mounted in the new peer
>   under /var/lib/glusterd.
> 
> - Create more volumes.
> 
> - Add the third peer to the cluster --> MV is expanded to a 3-way replicated
>   volume with the third brick in the new peer. MV is mounted under
>   /var/lib/glusterd in the new peer. This is the last time MV is expanded.
> 
> - Any further peers added to the cluster would only mount the MV under
>   /var/lib/glusterd.
> 
> The above restrictions placed on MV allow us to escape the need for a robust
> distributed
> store for MV's volume information and volume files.
> 
> ###Configuration details of MV
> - peers that are hosting bricks for MV would have a boolean option in
> glusterd.vol.
> For e.g something like,
>         option mv_host on
> 
> - The brick path for MV would have a default from the packaged glusterd.vol
> For e.g,
>         option mv_brick /mv/brick
> 
> - Replica count. This could be stored as part of glusterd.vol too.
> For e.g,
>         option mv_replica 3
> 
> - The ports for MV bricks could be reserved by glusterd's port mappper.  For
>   e.g, 49152 could be reserved for MV brick on each node, given that we would
>   have only one MV brick per peer.
> 
> - options to be set on volume - client-quorum, optionally proactive self heal
> enabled.
> 
> - MV would benefit from client-side quorum, server-side quorum and other
>   options. These could be preset (packaged) as part of glusterd.vol too.
> 
> - With brick path, ports and volume options present in glusterd.vol or preset
>   we can build the in-memory volume info representation on initialization of
>   glusterd.  This means we can generate MV's volume file dynamically in each
>   MV
>   hosting peer when needed and store in a 'known' location in local disk.
> 
> 
> ###Changes in glusterd command execution
> 
> Each peer modifies its configuration in /var/lib/glusterd in the commit phase
> of every command execution. With the introduction of MV, the peer in which
> the
> command is executed will perform the modifications to the configuration in
> /var/lib/glusterd after commit phase on the remaining available peers. Note,
> the other nodes don't perform any updates to MV.
> 
> 
> ###How to replace a 'dead' server/peer?
> 
> At the moment, I haven't thought of an automatic (or near semi-automatic) way
> of replacing a 'dead' peer. The manual steps should be as follows,
> 
> - If the 'dead' peer doesn't host MV bricks then the procedure as in previous
>   versions. This approach doesn't change anything.
> 
> - Provision a new server. Install glusterfs packages.
> 
> - Modify the glusterd.vol to have
>         option mv_host on
>         option mv_replica 3 #as the case may be
> 
> - Probe the peer to the cluster. glusterd on initialization would replace its
>   MV brick in MV and replication's healing should replicate the
>   configuration.
> 
> N.B This procedure assumes default MV config parameters. For non-default
> configuration,
> the brick path should also be updated in glusterd.vol in the new peer.
> 
> 
> ###How to upgrade from current version?
> 
> Following would be the steps,
> 
> - Stop all gluster{d,fs,fsd} processes by stopping the corresponding
> services.
> 
> - Upgrade to this version of glusterfs packages.
> 
> - Choose at most 3 servers/peers to build MV. In these nodes, create the
>   default brick directories; modify the (new) glusterd.vol to have
>         option mv_host on
>   Set replica count on each peer's glusterd.vol
>         option mv_replica 3 #say
> 
> - Move /var/lib/glusterd contents on each peer a to a temporary directory.
>   Say, /var/lib/glusterd.bkp
> 
> - Start glusterd service on one of the nodes, in 'upgrade' mode. In this
> mode,
>   glusterd would start the MV bricks and mount it on /var/lib/glusterd. It
>   will
>   not serve cli or mount requests.
> 
> - Copy the contents of /var/lib/glusterd.bkp on to (the mounted)
>   /var/lib/glusterd.
> 
> - Repeat this on all nodes in the cluster.
> 
> - Stop glusterd on all nodes. Start glusterd service on all nodes (in
> 'normal'
>   mode).
> 
> - Now the storage cluster should be ready for improved operations.
> 
> 
> ###How to upgrade from this version to future versions?
> 
> This is trickier than it should be given that we are holding MV's
> configuration
> in glusterd.vol, which is packaged. I would like to hear from the community
> for
> suggestions on this.
> 
> 
> ###References
> [1] -
> http://www.gluster.org/community/documentation/index.php/Features/thousand-node-glusterd.
> 
> [2] - This approach was initially recommended by Jeff Darcy, who is also the
> author of [1].
> 
> [3] - It shouldn't be hard to allow expanding MV beyond 3 bricks but most
> distributed configuration
>       stores recommend 3 or 5 way replication. At the least this could be
>       made configurable
>       via glusterd.vol.
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>