[Gluster-devel] Glusterd: A New Hope

Fri Mar 22 17:51:09 UTC 2013

On Fri, Mar 22, 2013 at 7:09 AM, Jeff Darcy <jdarcy at redhat.com> wrote:

> The need for some change here is keenly felt
> right now as we struggle to fix all of the race conditions that have
> resulted from the hasty addition of synctasks to make up for poor
> performance elsewhere in that 44K lines of C.

synctasks were not added for performance at all. glusterd being single
threaded was incapable of serving volfile in GETSPEC command or assign a
port in PORTMAP query when the very process it spawned
(glusterfs/glusterfs) would ask glusterd, and wait for the result from
glusterd before "finishing daemonizing" (so that a proper exit status be
returned), and glusterd would wait for glusterfsd to return before it got
back to epoll() and pick the portmap/getspec request -- resulting in a
deadlock.

Making it multi-threaded was inevitable if we wanted to even make "basic"
behavior right - i.e "gluster volume start" return success only if
glusterfsd successfully started or fail if it could not start (we would
_always_ return success).

But this is yet another example of how retrofitting threads on a single
threaded program can cause problems. It's not unusual to see races. Most of
them are fixable with a "general scheme of locking" practices applied in a
few places.

That being said, I'm open to exploring using other projects which have a
"good fit" with rest of glusterfs. It would certainly be nice to make it
"someone else's problem".

Avati

>  Delegating as much as
> possible of this functionality to mature code that is mostly maintained
> elsewhere would be very beneficial.  I've done some research since those
> meetings, and here are some results.
>
> The most basic idea here is to use an existing coordination service to
> store cluster configuration and state.  That service would then take
> responsibility for maintaining availability and consistency of the data
> under its care.  The best known example of such a coordination service
> is Apache's ZooKeeper[1], but there are others that don't have the
> noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]
> written in OCaml, ConCoord[4] written in Python.  These all provide a
> tightly consistent generally-hierarchical namespace for relatively small
> amounts of data.  In addition, there are two other features that might
> be useful.
>
> * Watches: register for notification of changes to an object (or
> directory/container), without having to poll.
>
> * Ephemerals: certain objects go away when the client that created them
> drops its connection to the server(s).
>
> Here's a rough sketch of how we'd use such a service.
>
> * Membership: a certain small set of servers (three or more) would be
> manually set up as coordination-service masters, e.g. via "peer probe
> xxx as master").  Other servers would connect to these masters, which
> would use ephemerals to update a "cluster map" object.  Both clients and
> servers could set up watches on the cluster map object to be notified of
> servers joining and leaving.
>
> * Configuration: the information we currently store in each volume's
> "info" file as the basis for generating volfiles (and perhaps the
> volfiles themselves) would be stored in the configuration service.
> Again, servers and clients could set watches on these objects to be
> notified of changes and do the appropriate graph switches, reconfigures,
> quorum actions, etc.
>
> * Maintenance operations: these would still run in glusterd (which isn't
> going away).  They would use the coordination for leader election to
> make sure the same activity isn't started twice, and to keep status
> updated in a way that allows other nodes to watch for changes.
>
> * Status queries: these would be handled entirely by querying objects
> within the coordination service.
>
> Of the alternatives available to us, only ZooKeeper directly supports
> all of the functionality we'd want.  However, the Java dependency is
> decidedly unpleasant for us and would be totally unacceptable to some of
> our users.  Doozer seems the closest of the remainder; it supports
> watches but not ephemerals, so we'd either have to synthesize those on
> top of doozer itself or find another way to handle membership (the only
> place where we use that functionality) based on the features it does
> have.  The project also seems reasonably mature and active, though we'd
> probably still have to devote some time to developing our own local
> doozer expertise.
>
> In a similar vein, another possibility would be to use *ourselves* as
> the coordination service, via a hand-configured AFR volume.  This is
> actually an approach Kaleb and I were seriously considering for HekaFS
> at the time of the acquisition, and it's not without its benefits.
> Using libgfapi we can prevent this special volume from having to be
> mounted, and we already know how to secure the communications paths for
> it (something that would require additional work with the other
> solutions).  On the other hand, it would probably require additional
> translators to provide both ephemerals and watches, and might require
> its own non-glusterd solution to issues like failure detection and
> self-heal, so it doesn't exactly meet the "make it somebody else's
> problem" criterion.
>
> In conclusion, I think our best (long term) way forward would be to
> prototype a doozer-based version of glusterd.  I could possibly be
> persuaded to try a "gluster on gluster" approach instead, but at this
> moment it wouldn't be my first choice.  Are there any other suggestions
> or objections before I forge ahead?
>
> [1] http://zookeeper.apache.org/
> [2] https://github.com/ha/doozerd
> [3] http://arakoon.org/
> [4] http://openreplica.org/doc/
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130322/db82d019/attachment-0001.html>