[Gluster-devel] Glusterd 'Management Volume' proposal

Mon Nov 24 14:29:58 UTC 2014

> We have been thinking of many approaches to address some of Glusterd's
> correctness (during failures and at scale) and scalability concerns. A
> recent email thread on Glusterd-2.0 was along these lines. While that
> discussion is still valid, we have been considering dogfooding as a
> viable option to solve our problems. This is not the first time this
> has been mentioned but for various reasons didn't really take off.
> The following proposal solves Glusterd's requirement for a distributed
> (consistent) store using a GlusterFS volume. Then who manages that
> GlusterFS volume? To find answers for that and more read further.

The main issue I have with this, and why I didn't suggest it myself, is
that it creates a bit of a "chicken and egg" problem.  Any kind of
server-side replication, such as NSR, depends on this subsystem to elect
leaders and store its own metadata.  How will these things be done if we
create a dependency in the other direction?  Even AFR has a dependency
to manage self-heal daemons, so it's not immune either.  Note also that
the brick daemons for the MV won't be able to rely on glusterd the same
way that current brick daemons do.  I think breaking the dependency
cycle is very likely to involve the creation of a dependency-free
component exactly like what the MV is supposed to avoid.

To be sure, maintaining external daemons such as etcd or consul creates
its own problems.  I think the ideal might be to embed a consensus
protocol implementation (Paxos, Raft, or Viewstamped Replication)
directly into glusterd, so it's guaranteed to start up and die exactly
when those daemons do and be subject to the same permission or resource
limits.  I'm not sure it's even more work than managing either an
external daemon or a management volume (with its own daemons).

> - MV would benefit from client-side quorum, server-side quorum and
>   other options. These could be preset (packaged) as part of
>   glusterd.vol too.

Server-side quorum will probably run into the same circular dependency
problem as mentioned above.

> ###Changes in glusterd command execution
>
> Each peer modifies its configuration in /var/lib/glusterd in the
> commit phase of every command execution. With the introduction of MV,
> the peer in which the command is executed will perform the
> modifications to the configuration in /var/lib/glusterd after commit
> phase on the remaining available peers.  Note, the other nodes don't
> perform any updates to MV.

We'll probably need to design some sort of multi-file locking protocol
on top of the POSIX single-file semantics.  That's OK, because pretty
much any other alternative will require something similar even if data
is in keys instead of files.

Also, how does notification of change happen?  "Watch" functionality is
standard across things like etcd/consul/ZK, and could be extremely handy
to get away from relying on glusterd's ad-hoc state machine to manage
notification phases, but the only way the MV could support this would be
to add inotify support (or something like it).