[Gluster-devel] Glusterd: A New Hope
J. Bruce Fields
bfields at fieldses.org
Fri Mar 22 20:33:55 UTC 2013
On Fri, Mar 22, 2013 at 10:09:44AM -0400, Jeff Darcy wrote:
> During the Bangalore "architects' summit" a couple of weeks ago, there
> was a discussion about making most functions of glusterd into Somebody
> Else's Problem. Examples include cluster membership, storage of volume
> configuration, and responding to changes in volume configuration.
Have you looked at what GFS2 does for comparison?
> those who haven't looked at it, glusterd is a bit of a maintenance and
> scalability problem with three kinds of RPC (client to glusterd,
> glusterd to glusterd, glusterd to glusterfsd) and its own ad-hoc
> transaction engine etc. The need for some change here is keenly felt
> right now as we struggle to fix all of the race conditions that have
> resulted from the hasty addition of synctasks to make up for poor
> performance elsewhere in that 44K lines of C. Delegating as much as
> possible of this functionality to mature code that is mostly maintained
> elsewhere would be very beneficial. I've done some research since those
> meetings, and here are some results.
> The most basic idea here is to use an existing coordination service to
> store cluster configuration and state. That service would then take
> responsibility for maintaining availability and consistency of the data
> under its care. The best known example of such a coordination service
> is Apache's ZooKeeper, but there are others that don't have the
> noxious Java dependency - e.g. doozer written in Go, Arakoon
> written in OCaml, ConCoord written in Python. These all provide a
> tightly consistent generally-hierarchical namespace for relatively small
> amounts of data. In addition, there are two other features that might
> be useful.
> * Watches: register for notification of changes to an object (or
> directory/container), without having to poll.
> * Ephemerals: certain objects go away when the client that created them
> drops its connection to the server(s).
> Here's a rough sketch of how we'd use such a service.
> * Membership: a certain small set of servers (three or more) would be
> manually set up as coordination-service masters, e.g. via "peer probe
> xxx as master"). Other servers would connect to these masters, which
> would use ephemerals to update a "cluster map" object. Both clients and
> servers could set up watches on the cluster map object to be notified of
> servers joining and leaving.
> * Configuration: the information we currently store in each volume's
> "info" file as the basis for generating volfiles (and perhaps the
> volfiles themselves) would be stored in the configuration service.
> Again, servers and clients could set watches on these objects to be
> notified of changes and do the appropriate graph switches, reconfigures,
> quorum actions, etc.
> * Maintenance operations: these would still run in glusterd (which isn't
> going away). They would use the coordination for leader election to
> make sure the same activity isn't started twice, and to keep status
> updated in a way that allows other nodes to watch for changes.
> * Status queries: these would be handled entirely by querying objects
> within the coordination service.
> Of the alternatives available to us, only ZooKeeper directly supports
> all of the functionality we'd want. However, the Java dependency is
> decidedly unpleasant for us and would be totally unacceptable to some of
> our users. Doozer seems the closest of the remainder; it supports
> watches but not ephemerals, so we'd either have to synthesize those on
> top of doozer itself or find another way to handle membership (the only
> place where we use that functionality) based on the features it does
> have. The project also seems reasonably mature and active, though we'd
> probably still have to devote some time to developing our own local
> doozer expertise.
> In a similar vein, another possibility would be to use *ourselves* as
> the coordination service, via a hand-configured AFR volume. This is
> actually an approach Kaleb and I were seriously considering for HekaFS
> at the time of the acquisition, and it's not without its benefits.
> Using libgfapi we can prevent this special volume from having to be
> mounted, and we already know how to secure the communications paths for
> it (something that would require additional work with the other
> solutions). On the other hand, it would probably require additional
> translators to provide both ephemerals and watches, and might require
> its own non-glusterd solution to issues like failure detection and
> self-heal, so it doesn't exactly meet the "make it somebody else's
> problem" criterion.
> In conclusion, I think our best (long term) way forward would be to
> prototype a doozer-based version of glusterd. I could possibly be
> persuaded to try a "gluster on gluster" approach instead, but at this
> moment it wouldn't be my first choice. Are there any other suggestions
> or objections before I forge ahead?
>  http://zookeeper.apache.org/
>  https://github.com/ha/doozerd
>  http://arakoon.org/
>  http://openreplica.org/doc/
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
More information about the Gluster-devel