[Gluster-devel] solutions for split brain situation
Mark Mielke
mark at mark.mielke.cc
Wed Sep 16 21:18:06 UTC 2009
On 09/16/2009 05:45 AM, Gordan Bobic wrote:
> It's not my project (I'm just a user of it), but having done my research, my conclusion is that there is nothing else available that is similar to GlusterFS. The world has waited a long time for this, and imperfect as it may be, I don't see anything else similar on the horizon.
>
> GlusterFS is an implementation of something that has only been academically discussed elsewhere. And I haven't seen any evidence of any other similar things being implemented any time soon. But if you think you can do better, go for it. :-)
>
I came to a slightly different conclusion, but similar effect. Of the
projects available, GlusterFS is the closest to production *today*. The
world has waited a long time for this. It is imperfect, but right now
it's still high on the list of solutions that can be used today and have
potential for tomorrow.
In case it is of any use to other, here is the list I had worked out
before when doing my analysis:
- GlusterFS (http://gluster.com/community/index.php) - Very
promising shared nothing architecture, production ready software
supported commercially, based on FUSE (provides insulation from the
kernel at a small performance cost). Simple configuration. Very cute
implementation where each "brick" for a "cluster/replication" setup is
just a regular file system that can be accessed natively, so the data is
always safe and can be inspected using UNIX commands or backed up using
rsync. Most logic is client side, including replication, and they use
file system attributes to journal changes and "self-heal". But, very
recently there has been some problems, possibly with how GlusterFS calls
Linux, triggering a Linux problem that causes the system to freeze up a
bit. My own first test froze things up. The GlusterFS support people
want to find the problem and I will be working with them to see whether
this can be resolved or not.
- Ceph (http://ceph.newdream.net/) - Very promising shared nothing
architecture, that has kernel module support instead of FUSE (better
performance) but not ready for production. They say they will stabilize
it by the end of 2009, but do not recommend using it for production even
at that time.
- PVFS (http://www.pvfs.org/) - Very promising architecture. Widely
used in production. V1 has a shared metadata server. V2 they are
changing to a shared nothing architecture. Has kernel module support
instead of FUSE (better performance). However, PVFS does not provide
POSIX guarantees. In particular, the do not implement advisory locking
through flock()/fcntl(). This means that use of this system would
probably require an architecture that does master/slave fail over as
opposed to master/master fail over. Most file system accesses do not
care for this level of locking, but dovecot in particular probably does.
The dovecot locking through .lock files might work, but I need to look a
little closer.
- Grid Datafarm (http://datafarm.apgrid.org/) - Designed as a user
space data sharing mechanism, however a FUSE module is available to
provide POSIX functionality on top.
- Lustre (http://www.lustre.org/) - Seems to be the focus of the
Commercial world. Currently based on ext3/ext4, to be based on ZFS in
2010.Weakness seems to be on having a single shared metadata server that
must be highly available using a shared disk solution such as GFS or
OCFS. Due to this architecture, I do not consider this solution to meet
our requirements of a shared nothing architecture where any server can
completely die, and the other server take over the load without
intervention.
- MooseFS (http://www.moosefs.com/) - Alternative to Lustre. Still
uses a shared metadata server, and therefore does not meet requirements.
- XtreemFS (http://en.wikipedia.org/wiki/XtreemFS) - Very promising
architecture. However, current version uses single metadata server and
will only replicate content that is specifically marked as read only.
Replicated metadata scheduled for 2010Q1. Read/write replication
scheduled for some time later.
- CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs
is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre,
although development of this solution seems slow and this system is not
ready for production. Development for both have effectively stalled
since 2008. If these are ever released, I think they will be great
solutions, but they are apparently having designs problems (either
developers who are not good enough, or the design is too complicated,
probably both).
- TahoeFS (http://allmydata.org/trac/tahoe) - POSIX interface (via
FUSE) not ready for production.
- Coda (http://www.coda.cs.cmu.edu/) and Inter-Mezzo
(http://en.wikipedia.org/wiki/InterMezzo_%28file_system%29) - Older
"experimental" distributed file system still being maintained, but no
development beyond bugfixes that I can see. They say the developers have
moved on to Lustre.
I am still having some problems with GlusterFS - I rebooted my machines
at the exact same time and all three came up frozen in the mount call.
Now that I know how to clear the problem - ssh in with another window,
and kill -9 the mount, it isn't so bad - but I can't take this to
production unless this issue is resolved. I'll try to come up with
better details.
Cheers,
mark
--
Mark Mielke<mark at mielke.cc>
More information about the Gluster-devel
mailing list