[Gluster-devel] solutions for split brain situation

Wed Sep 16 22:49:24 UTC 2009

Some comments as a user of the open source version, and as a reseller of 
the commercial version, including having provided emergency support to 
users ...  take these with a grain of salt if you wish.

Mark Mielke wrote:
> On 09/16/2009 05:45 AM, Gordan Bobic wrote:
>> It's not my project (I'm just a user of it), but having done my 

[...]

> I came to a slightly different conclusion, but similar effect. Of the 
> projects available, GlusterFS is the closest to production *today*. The 

As a user of many file systems over (quite) a span of time, I have as of 
yet to see "the one true file system that is really bug free, always 
works, and never fails."  All software is buggy.  Some more so than 
others, but all software is buggy.  Anyone telling you otherwise is 
trying to sell something to you.

> world has waited a long time for this. It is imperfect, but right now 
> it's still high on the list of solutions that can be used today and have 
> potential for tomorrow.

Every storage design and implementation you do, you need to ask yourself 
"if this went away, what would be the impact upon me and my work?"  You 
then need to design to this.  Failure to do so ... well ...

> In case it is of any use to other, here is the list I had worked out 
> before when doing my analysis:
> 
>     - GlusterFS (http://gluster.com/community/index.php) - Very 
> promising shared nothing architecture, production ready software 
> supported commercially, based on FUSE (provides insulation from the 
> kernel at a small performance cost). Simple configuration. Very cute 
> implementation where each "brick" for a "cluster/replication" setup is 
> just a regular file system that can be accessed natively, so the data is 
> always safe and can be inspected using UNIX commands or backed up using 
> rsync. Most logic is client side, including replication, and they use 
> file system attributes to journal changes and "self-heal". But, very 
> recently there has been some problems, possibly with how GlusterFS calls 
> Linux, triggering a Linux problem that causes the system to freeze up a 
> bit. My own first test froze things up. The GlusterFS support people 
> want to find the problem and I will be working with them to see whether 
> this can be resolved or not.
> 
>     - Ceph (http://ceph.newdream.net/) - Very promising shared nothing 
> architecture, that has kernel module support instead of FUSE (better 
> performance) but not ready for production. They say they will stabilize 
> it by the end of 2009, but do not recommend using it for production even 
> at that time.

Ceph is very interesting, and should be one to watch over time.  Sage 
and group seem to have fewer resources at their disposal than 
Z-Research, so evolution may take longer.

> 
>     - PVFS (http://www.pvfs.org/) - Very promising architecture. Widely 
> used in production. V1 has a shared metadata server. V2 they are 
> changing to a shared nothing architecture. Has kernel module support 
> instead of FUSE (better performance). However, PVFS does not provide 
> POSIX guarantees. In particular, the do not implement advisory locking 
> through flock()/fcntl(). This means that use of this system would 
> probably require an architecture that does master/slave fail over as 
> opposed to master/master fail over. Most file system accesses do not 
> care for this level of locking, but dovecot in particular probably does. 
> The dovecot locking through .lock files might work, but I need to look a 
> little closer.

PVFS is not a POSIX file system.  You shouldn't try to use it as one. 
PVFS2 is the current release, and as Dan from Synthetic Genomics might 
note, it has some issues with codes that want to use it as a parallel 
POSIX file system.  PVFS2 is purpose built for MPI-IO and related codes. 
  There is nothing wrong with this, and in fact, this is a good thing, 
as MPI-IO capabilities are very important in HPC sectors.

Probably not so important for Dovecot.

[...]

>     - Lustre (http://www.lustre.org/) - Seems to be the focus of the 
> Commercial world. Currently based on ext3/ext4, to be based on ZFS in 
> 2010.Weakness seems to be on having a single shared metadata server that 
> must be highly available using a shared disk solution such as GFS or 
> OCFS. Due to this architecture, I do not consider this solution to meet 
> our requirements of a shared nothing architecture where any server can 
> completely die, and the other server take over the load without 
> intervention.

Lustre is dependent upon Sun, and there are, to put it mildly, concerns 
over its future within Oracle.  Oracle isn't really in the high 
performance computing market, which is where Lustre plays.  I won't go 
into more depth here on its future.

Lustre is predominantly an object based storage system.  It depends 
critically upon features that require very specific kernels and kernel 
patches, which tend to make it incompatible with requirements of keeping 
the distro specific kernels.

The migration to ZFS has been seen in some circles (people have 
mentioned this to us) as a migration over to solaris, which has caused 
numerous users to start to look at transition plans off of Lustre. 
Which is hard, when you have Petabytes of data ... moving it ain't easy.

>     - CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs 
> is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre, 
> although development of this solution seems slow and this system is not 
> ready for production. Development for both have effectively stalled 
> since 2008. If these are ever released, I think they will be great 
> solutions, but they are apparently having designs problems (either 
> developers who are not good enough, or the design is too complicated, 
> probably both).

BTRFS has most definitely not stalled.  It is now in the Linux kernel as 
of 2.6.29, and is the target file system for a number of well known 
distros going forward.  Ext4 is simply not viable for the storage sizes 
people are contemplating.  Xfs, a venerable file system, has most of its 
developers at SGI, which has obvious risks associated with that.  jfs 
may not be actively developed anymore.  Chris Mason has been very 
actively doing btrfs work as far as I can tell from the various 
sources:http://btrfs.wiki.kernel.org/index.php/Main_Page#News

CRFS is dependent upon BTRFS, so CRFS is more of a placeholder.

With Sun owning ZFS and Oracle BTRFS, given that the latter is GPL 
compliant and the former is not (and is patent encumbered), I expect 
more work on BTRFS going forward for Linux, an important platform for 
Oracle.  Solaris is not increasing in installed base, rather it is 
rapidly doing the opposite, and this trend isn't likely lost on Oracle.

Of course, we could be wrong, and our biases are in part due to what we 
sell, resell, and support, so take what I say with a grain of salt if 
you wish.

I do expect GlusterFS to work well atop BTRFS in the not so distant future.

You did neglect pNFS in your notes.  Its sort of the "pink elephant" in 
the room.  There are good things about it, and some ... er ... 
challenging things about it.  I expect the kerberos requirements (and 
all this implies) aren't going to help its adoption.  If you haven't 
dealt with a kerberos installation and management situation, you might 
not get this.

Also pohlemfs in Linux was included in 2.6.29.  This is an interesting 
parallel file system, but we haven't played with it much yet.

Finally, on the other file systems you should pay attention to, nilfs2 
looks to be quite interesting.  Continuous snapshotting is quite 
interesting, though how this could be used from within in GlusterFS 
(GlusterFS atop nilfs2) isn't completely apparent yet.  It could make 
for some very powerful capability in GlusterFS if the developers go this 
route.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615