[Gluster-devel] Glusterd 'Management Volume' proposal

Tue Nov 25 04:56:38 UTC 2014

> The main issue I have with this, and why I didn't suggest it myself, is
> that it creates a bit of a "chicken and egg" problem.  Any kind of
> server-side replication, such as NSR, depends on this subsystem to elect
> leaders and store its own metadata.  How will these things be done if we
> create a dependency in the other direction?  Even AFR has a dependency
> to manage self-heal daemons, so it's not immune either.  Note also that
> the brick daemons for the MV won't be able to rely on glusterd the same
> way that current brick daemons do.  I think breaking the dependency
> cycle is very likely to involve the creation of a dependency-free
> component exactly like what the MV is supposed to avoid.

It is known to us that MV comes with a "chicken and egg" problem.
The idea was to see if we can resolve the circular dependency by
managing MV entirely independent of the code that manages regular
volumes. This may mean that only a reduced set of features are applicable
to MV. For instance, MV's self-healing would only be reactive and wouldn't be
managed by self-heal daemon. MV is not expected to have large files or 
large number of files in comparison with regular volumes. This makes disabling
of "proactive self-healing" less problematic.
It wasn't thought through how NSR would work with a reduced set of features
in MV, when it is used as the replication technology. This is definitely
a blocker unless we can work around this somehow.

> 
> To be sure, maintaining external daemons such as etcd or consul creates
> its own problems.  I think the ideal might be to embed a consensus
> protocol implementation (Paxos, Raft, or Viewstamped Replication)
> directly into glusterd, so it's guaranteed to start up and die exactly
> when those daemons do and be subject to the same permission or resource
> limits.  I'm not sure it's even more work than managing either an
> external daemon or a management volume (with its own daemons).

To solve the store consistency problem it may be enough to implement
Raft (or any other consensus algorithm) but if we aspire to make acquiring
data (among the servers) from the distributed store to scale, we would need
a 'watch' functionality. IMO, it is easier to implement a key-value
interface than a POSIX-y interface for the store. The end solution would be
dangerously (sic) similar to consul/etcd/ZK. This bring us to whether we
piggyback on reasonably mature solutions that provide the same functionality
or build one ourselves. I am torn between the two. Which approach would
be practical? Thoughts? MV was born while we were exploring a middle ground.

> 
> > - MV would benefit from client-side quorum, server-side quorum and
> >   other options. These could be preset (packaged) as part of
> >   glusterd.vol too.
> 
> Server-side quorum will probably run into the same circular dependency
> problem as mentioned above.
> 
> > ###Changes in glusterd command execution
> >
> > Each peer modifies its configuration in /var/lib/glusterd in the
> > commit phase of every command execution. With the introduction of MV,
> > the peer in which the command is executed will perform the
> > modifications to the configuration in /var/lib/glusterd after commit
> > phase on the remaining available peers.  Note, the other nodes don't
> > perform any updates to MV.
> 
> We'll probably need to design some sort of multi-file locking protocol
> on top of the POSIX single-file semantics.  That's OK, because pretty
> much any other alternative will require something similar even if data
> is in keys instead of files.
> 
> Also, how does notification of change happen?  "Watch" functionality is
> standard across things like etcd/consul/ZK, and could be extremely handy
> to get away from relying on glusterd's ad-hoc state machine to manage
> notification phases, but the only way the MV could support this would be
> to add inotify support (or something like it).
> 

In consul's docs[1] it's given that watches are implemented by the agent
performing a blocking HTTP API query on a given key (or key-prefix or other special
namespace nodes). We could build a similar notification functionality on top of the MV's
shared namespace with,
- 'server' component looking for changes in the requested set of keys periodically
- 'client' component (caller) making a blocking (or a timed wait) request in a separate
thread if required

[1] - https://consul.io/docs/agent/watches.html