[Gluster-devel] Managing etcd (4.0)

Wed Sep 9 20:12:43 UTC 2015

Better get comfortable, everyone, because I might ramble on for a bit.

Over the last few days, I've been looking into the issue of how to manage our own instances of etcd (or something similar) as part of our 4.0 configuration store.  This is highly relevant for GlusterD 2.0, which would be both a consumer of the service and (possibly) a manager for the daemons that provide it.  It's also relevant for NSR, which needs a similar kind of highly-available highly-consistent store for information about terms.  Just about any other component might be able to take good advantage of such a facility if it were available, such as DHT 2.0 using it for layout information, and I encourage anyone working on 4.0 to think about how it can make other components simpler.  (BTW, Shyam, that's just a hypothetical example.  Don't take it any more seriously than you want to.)

This is not the first time I've looked into this.  During the previous round of NSR development, I implemented some code to manage etcd daemons from within GlusterD:

    http://review.gluster.org/#/c/8887/

That code's junk.  We shouldn't use anything more than small pieces of it.  Among other problems, it nukes the etcd information when a new node joins.  That was fine for what we were doing with NSR at the time, but clearly can't work in real life.  I've also been looking at the new-ish etcd interfaces for cluster management:

    https://github.com/coreos/etcd/blob/master/Documentation/other_apis.md

I'm pretty sure these didn't exist when I was last looking at this stuff, but I could be wrong.  In any case, they look pretty nice.  Much like our own "probe" mechanism, it looks like we can start a single-node cluster and then add others into that cluster by talking to one of the current members.  In fact, that similarity suggests how we might manage our instances of etcd.

(1) Each GlusterD *initially* starts its own private instance of etcd.

(2) When we probe from a node X to a node Y, the probe message includes information about X's etcd server(s).

(3) Upon receipt of a probe, Y can (depending on a flag) either *use* X's etcd cluster or *join* it.  Either way, it has to shut down its own one-node cluster.  In the JOIN case, this implies that X will send the appropriate etcd command to its local instance (from whence it will be propagated to the others).

(4) Therefore, the CLI/REST interfaces to initiate a probe need an option to control this join/use flag.  Default should be JOIN for small clusters, where it's not a problem for all nodes to be etcd servers as well.

(5) For larger clusters, the administrator might start to specify USE instead of JOIN after a while.  There might also need to be separate CLI/REST interfaces to toggle this state without any probe involved.

(6) For detach/deprobe, we simply undo the things we did in (3).

With all of this in place, probes would become one-time exchanges.  There's no need for GlusterD daemons to keep probing each other when they can just "check in" with etcd (which is doing something very similar internally).  Instead of constantly sending its own probe/heartbeat messages and keeping track of which others nodes' messages have been missed, each GlusterD would simply use its node UUID to create a time-limited key in etcd, and issue watches on other nodes' keys.  This is not quite as convenient as ZooKeeper's ephemerals, but it's still a lot better than what we're doing now.

I'd be tempted to implement this myself, but for now it's probably more important to work on NSR itself and for that I can just use an external etcd cluster instead.  Maybe later in the 4.0 integration phase, if nobody else has beaten me to it, I'll take a swing at it.  Until then, does anyone else have any thoughts on the proposal?