[Gluster-devel] Glusterd 'Management Volume' proposal

Wed Nov 19 07:22:36 UTC 2014

All,

We have been thinking of many approaches to address some of Glusterd's correctness
(during failures and at scale) and scalability concerns. A recent email thread on 
Glusterd-2.0 was along these lines. While that discussion is still valid, we have been
considering dogfooding as a viable option to solve our problems. This is not the first
time this has been mentioned but for various reasons didn't really take off. The following
proposal solves Glusterd's requirement for a distributed (consistent) store using a GlusterFS
volume. Then who manages that GlusterFS volume? To find answers for that and more
read further.

[The following content is also available here: https://gist.github.com/krisis/945e45e768ef1c4e446d
Please keep the discussions on the mailing list and _not_ in github, for tractibility
reasons.]

##Abstract

Glusterd, the management daemon for GlusterFS, maintains volume and cluster
configuration store using an home-grown replication algorithm. Some shortcomings
are as follows.

- Involves O(N^2) (in number of nodes) network messages to replicate
  configuration changes for every command

- Doesn't rely on quorum and not resilient to network partitions

- Recovery of nodes that come back online can choke the network at scale

The thousand node glusterd proposal[1], one of the more mature proposals
addressing the above problems, recommends use of a consistent distributed
stores like consul/etcd for maintaining the volume and cluster configuration.
While the technical merits of this approach make it compelling the operational
challenges like coordinating between the two communities for releases and
bug-fixes could get out of hand.  An alternate approach[2] is to use a
replicated GlusterFS volume as the distributed store instead. The remainder of
this email explains how a GlusterFS volume could be used to store configuration
information.

##Technical details

We will refer to the replicated GlusterFS volume used for storing configuration
as the Management volume (MV). The following section describes how MV would be
managed.

###MV management

To begin with we can restrict the MV to a pure replicated volume with a maximum
of 3 bricks on 3 different nodes[3]. The brick path can be stored in glusterd.vol
which is packaged. MV will come into existence only after the first peer probe
or first volume create operation.

The following example of setting up a Glusterfs storage cluster highlights how
things work in the proposed scheme of things.

- Install glusterfs server packages on a storage node.

- Start glusterd service.

- Create a volume. --> Now, the MV is created with one brick and mounted under
  /var/lib/glusterd

- Add a peer to the cluster --> Now, MV is expanded to a 2-way replicated
  volume with the second brick in the new peer. MV is mounted in the new peer
  under /var/lib/glusterd.

- Create more volumes.

- Add the third peer to the cluster --> MV is expanded to a 3-way replicated
  volume with the third brick in the new peer. MV is mounted under
  /var/lib/glusterd in the new peer. This is the last time MV is expanded.

- Any further peers added to the cluster would only mount the MV under
  /var/lib/glusterd.

The above restrictions placed on MV allow us to escape the need for a robust distributed
store for MV's volume information and volume files.

###Configuration details of MV
- peers that are hosting bricks for MV would have a boolean option in glusterd.vol.
For e.g something like,
        option mv_host on

- The brick path for MV would have a default from the packaged glusterd.vol
For e.g,
        option mv_brick /mv/brick

- Replica count. This could be stored as part of glusterd.vol too.
For e.g,
        option mv_replica 3

- The ports for MV bricks could be reserved by glusterd's port mappper.  For
  e.g, 49152 could be reserved for MV brick on each node, given that we would
  have only one MV brick per peer.

- options to be set on volume - client-quorum, optionally proactive self heal enabled.

- MV would benefit from client-side quorum, server-side quorum and other
  options. These could be preset (packaged) as part of glusterd.vol too.

- With brick path, ports and volume options present in glusterd.vol or preset
  we can build the in-memory volume info representation on initialization of
  glusterd.  This means we can generate MV's volume file dynamically in each MV
  hosting peer when needed and store in a 'known' location in local disk.

###Changes in glusterd command execution

Each peer modifies its configuration in /var/lib/glusterd in the commit phase
of every command execution. With the introduction of MV, the peer in which the
command is executed will perform the modifications to the configuration in
/var/lib/glusterd after commit phase on the remaining available peers. Note,
the other nodes don't perform any updates to MV.

###How to replace a 'dead' server/peer?

At the moment, I haven't thought of an automatic (or near semi-automatic) way
of replacing a 'dead' peer. The manual steps should be as follows,

- If the 'dead' peer doesn't host MV bricks then the procedure as in previous
  versions. This approach doesn't change anything.

- Provision a new server. Install glusterfs packages.

- Modify the glusterd.vol to have
        option mv_host on
        option mv_replica 3 #as the case may be

- Probe the peer to the cluster. glusterd on initialization would replace its
  MV brick in MV and replication's healing should replicate the configuration.

N.B This procedure assumes default MV config parameters. For non-default configuration,
the brick path should also be updated in glusterd.vol in the new peer.

###How to upgrade from current version?

Following would be the steps,

- Stop all gluster{d,fs,fsd} processes by stopping the corresponding services.

- Upgrade to this version of glusterfs packages.

- Choose at most 3 servers/peers to build MV. In these nodes, create the
  default brick directories; modify the (new) glusterd.vol to have
        option mv_host on
  Set replica count on each peer's glusterd.vol
        option mv_replica 3 #say

- Move /var/lib/glusterd contents on each peer a to a temporary directory.
  Say, /var/lib/glusterd.bkp

- Start glusterd service on one of the nodes, in 'upgrade' mode. In this mode,
  glusterd would start the MV bricks and mount it on /var/lib/glusterd. It will
  not serve cli or mount requests.

- Copy the contents of /var/lib/glusterd.bkp on to (the mounted)
  /var/lib/glusterd.

- Repeat this on all nodes in the cluster.

- Stop glusterd on all nodes. Start glusterd service on all nodes (in 'normal'
  mode).

- Now the storage cluster should be ready for improved operations.

###How to upgrade from this version to future versions?

This is trickier than it should be given that we are holding MV's configuration
in glusterd.vol, which is packaged. I would like to hear from the community for
suggestions on this.

###References
[1] - http://www.gluster.org/community/documentation/index.php/Features/thousand-node-glusterd.

[2] - This approach was initially recommended by Jeff Darcy, who is also the
author of [1].

[3] - It shouldn't be hard to allow expanding MV beyond 3 bricks but most distributed configuration
      stores recommend 3 or 5 way replication. At the least this could be made configurable
      via glusterd.vol.