[Gluster-devel] Glusterd: A New Hope

Fri Mar 22 14:09:44 UTC 2013

During the Bangalore "architects' summit" a couple of weeks ago, there
was a discussion about making most functions of glusterd into Somebody
Else's Problem.  Examples include cluster membership, storage of volume
configuration, and responding to changes in volume configuration.  For
those who haven't looked at it, glusterd is a bit of a maintenance and
scalability problem with three kinds of RPC (client to glusterd,
glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc
transaction engine etc.  The need for some change here is keenly felt
right now as we struggle to fix all of the race conditions that have
resulted from the hasty addition of synctasks to make up for poor
performance elsewhere in that 44K lines of C.  Delegating as much as
possible of this functionality to mature code that is mostly maintained
elsewhere would be very beneficial.  I've done some research since those
meetings, and here are some results.

The most basic idea here is to use an existing coordination service to
store cluster configuration and state.  That service would then take
responsibility for maintaining availability and consistency of the data
under its care.  The best known example of such a coordination service
is Apache's ZooKeeper[1], but there are others that don't have the
noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]
written in OCaml, ConCoord[4] written in Python.  These all provide a
tightly consistent generally-hierarchical namespace for relatively small
amounts of data.  In addition, there are two other features that might
be useful.

* Watches: register for notification of changes to an object (or
directory/container), without having to poll.

* Ephemerals: certain objects go away when the client that created them
drops its connection to the server(s).

Here's a rough sketch of how we'd use such a service.

* Membership: a certain small set of servers (three or more) would be
manually set up as coordination-service masters, e.g. via "peer probe
xxx as master").  Other servers would connect to these masters, which
would use ephemerals to update a "cluster map" object.  Both clients and
servers could set up watches on the cluster map object to be notified of
servers joining and leaving.

* Configuration: the information we currently store in each volume's
"info" file as the basis for generating volfiles (and perhaps the
volfiles themselves) would be stored in the configuration service.
Again, servers and clients could set watches on these objects to be
notified of changes and do the appropriate graph switches, reconfigures,
quorum actions, etc.

* Maintenance operations: these would still run in glusterd (which isn't
going away).  They would use the coordination for leader election to
make sure the same activity isn't started twice, and to keep status
updated in a way that allows other nodes to watch for changes.

* Status queries: these would be handled entirely by querying objects
within the coordination service.

Of the alternatives available to us, only ZooKeeper directly supports
all of the functionality we'd want.  However, the Java dependency is
decidedly unpleasant for us and would be totally unacceptable to some of
our users.  Doozer seems the closest of the remainder; it supports
watches but not ephemerals, so we'd either have to synthesize those on
top of doozer itself or find another way to handle membership (the only
place where we use that functionality) based on the features it does
have.  The project also seems reasonably mature and active, though we'd
probably still have to devote some time to developing our own local
doozer expertise.

In a similar vein, another possibility would be to use *ourselves* as
the coordination service, via a hand-configured AFR volume.  This is
actually an approach Kaleb and I were seriously considering for HekaFS
at the time of the acquisition, and it's not without its benefits.
Using libgfapi we can prevent this special volume from having to be
mounted, and we already know how to secure the communications paths for
it (something that would require additional work with the other
solutions).  On the other hand, it would probably require additional
translators to provide both ephemerals and watches, and might require
its own non-glusterd solution to issues like failure detection and
self-heal, so it doesn't exactly meet the "make it somebody else's
problem" criterion.

In conclusion, I think our best (long term) way forward would be to
prototype a doozer-based version of glusterd.  I could possibly be
persuaded to try a "gluster on gluster" approach instead, but at this
moment it wouldn't be my first choice.  Are there any other suggestions
or objections before I forge ahead?

[1] http://zookeeper.apache.org/
[2] https://github.com/ha/doozerd
[3] http://arakoon.org/
[4] http://openreplica.org/doc/