[Gluster-devel] solutions for split brain situation

Wed Sep 16 21:18:06 UTC 2009

On 09/16/2009 05:45 AM, Gordan Bobic wrote:
> It's not my project (I'm just a user of it), but having done my research, my conclusion is that there is nothing else available that is similar to GlusterFS. The world has waited a long time for this, and imperfect as it may be, I don't see anything else similar on the horizon.
>
> GlusterFS is an implementation of something that has only been academically discussed elsewhere. And I haven't seen any evidence of any other similar things being implemented any time soon. But if you think you can do better, go for it. :-)
>    

I came to a slightly different conclusion, but similar effect. Of the 
projects available, GlusterFS is the closest to production *today*. The 
world has waited a long time for this. It is imperfect, but right now 
it's still high on the list of solutions that can be used today and have 
potential for tomorrow.

In case it is of any use to other, here is the list I had worked out 
before when doing my analysis:

     - GlusterFS (http://gluster.com/community/index.php) - Very 
promising shared nothing architecture, production ready software 
supported commercially, based on FUSE (provides insulation from the 
kernel at a small performance cost). Simple configuration. Very cute 
implementation where each "brick" for a "cluster/replication" setup is 
just a regular file system that can be accessed natively, so the data is 
always safe and can be inspected using UNIX commands or backed up using 
rsync. Most logic is client side, including replication, and they use 
file system attributes to journal changes and "self-heal". But, very 
recently there has been some problems, possibly with how GlusterFS calls 
Linux, triggering a Linux problem that causes the system to freeze up a 
bit. My own first test froze things up. The GlusterFS support people 
want to find the problem and I will be working with them to see whether 
this can be resolved or not.

     - Ceph (http://ceph.newdream.net/) - Very promising shared nothing 
architecture, that has kernel module support instead of FUSE (better 
performance) but not ready for production. They say they will stabilize 
it by the end of 2009, but do not recommend using it for production even 
at that time.

     - PVFS (http://www.pvfs.org/) - Very promising architecture. Widely 
used in production. V1 has a shared metadata server. V2 they are 
changing to a shared nothing architecture. Has kernel module support 
instead of FUSE (better performance). However, PVFS does not provide 
POSIX guarantees. In particular, the do not implement advisory locking 
through flock()/fcntl(). This means that use of this system would 
probably require an architecture that does master/slave fail over as 
opposed to master/master fail over. Most file system accesses do not 
care for this level of locking, but dovecot in particular probably does. 
The dovecot locking through .lock files might work, but I need to look a 
little closer.

     - Grid Datafarm (http://datafarm.apgrid.org/) - Designed as a user 
space data sharing mechanism, however a FUSE module is available to 
provide POSIX functionality on top.

     - Lustre (http://www.lustre.org/) - Seems to be the focus of the 
Commercial world. Currently based on ext3/ext4, to be based on ZFS in 
2010.Weakness seems to be on having a single shared metadata server that 
must be highly available using a shared disk solution such as GFS or 
OCFS. Due to this architecture, I do not consider this solution to meet 
our requirements of a shared nothing architecture where any server can 
completely die, and the other server take over the load without 
intervention.

     - MooseFS (http://www.moosefs.com/) - Alternative to Lustre. Still 
uses a shared metadata server, and therefore does not meet requirements.

     - XtreemFS (http://en.wikipedia.org/wiki/XtreemFS) - Very promising 
architecture. However, current version uses single metadata server and 
will only replicate content that is specifically marked as read only. 
Replicated metadata scheduled for 2010Q1. Read/write replication 
scheduled for some time later.

     - CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs 
is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre, 
although development of this solution seems slow and this system is not 
ready for production. Development for both have effectively stalled 
since 2008. If these are ever released, I think they will be great 
solutions, but they are apparently having designs problems (either 
developers who are not good enough, or the design is too complicated, 
probably both).

     - TahoeFS (http://allmydata.org/trac/tahoe) - POSIX interface (via 
FUSE) not ready for production.

     - Coda (http://www.coda.cs.cmu.edu/) and Inter-Mezzo 
(http://en.wikipedia.org/wiki/InterMezzo_%28file_system%29) - Older 
"experimental" distributed file system still being maintained, but no 
development beyond bugfixes that I can see. They say the developers have 
moved on to Lustre.

I am still having some problems with GlusterFS - I rebooted my machines 
at the exact same time and all three came up frozen in the mount call. 
Now that I know how to clear the problem - ssh in with another window, 
and kill -9 the mount, it isn't so bad - but I can't take this to 
production unless this issue is resolved. I'll try to come up with 
better details.

Cheers,
mark

-- 
Mark Mielke<mark at mielke.cc>