[Gluster-devel] solutions for split brain situation

Wed Sep 16 09:45:00 UTC 2009

>> You always need to copy in data from the mountpoint. glusterfs does
>> not support working over existing data. Because of the additional
>> restrictions glusterfs imposes on the backend maintenance, it is just
>> not feasible to support concurrent access to the backend as well as
>> from the mountpoint.
>
> Can you elaborate what these additional restrictions are like?

>From the GlusterFS POV, you have to make sure the journal data is correct on all the files. If those aren't 100% correct, you will see cases where it isn't obvious which file/directory should "win", and you'll see split-brains, erroneous clobbering or data corruption. It's roughly equivalent to having some applications using a fs as ext2 (unjournalled) and others as ext3 on the same files and without cache coherency - you cannot expect that to yield sensible and consistent results.

>> The user is having a wrong expectation. glusterfs has never meant to
>> be a seamless replacement for NFS. The purpose of goals of NFS are
>> very different from glusterfs. It so happens that glusterfs uses an
>> existing disk filesystem as a backend, and so does NFS. But NFS is
>> _supposed_ to work fine with concurrent server side access. glusterfs
>> on the other hand has never claimed such a support, nor do we intend
>> to support this use case anytime soon. As an analogy, accessing the
>> backend export of glusterfs directly is to be considered as risky as
>> accessing the block layer of a disk filesystem.

> That isn't really a valid analogy, because you imply that one wants to write
> to random blocks be they used or unused in a disk-layer situation. But the
> fact is that the only difference between a file fed over glusterfs-mountpoint
> and fed directly are the xattribs. The files' content ist exactly the same.

Not necessarily - there are also buffering issues and write ordering. I'd say the analogy is pretty valid.

> This probably means that more than 99% of the whole fs content is the same.
> So all you have to do inside glusterfs is to notice if a file has been fed by
> mountpoint or by local feed.

And how would you handle lock management and write ordering across servers, then? You expect this to happen magically? Or would you be happy with random file corruption arising from concurrent access?

> In the standard case (mountpoint) you just proceed. In case of a recognised
> local feed you only have to give it your standard xattribs (just as the file
> had been freshly fed over a mountpoint) and start self-heal to distribute it
> to other subvolumes (we only look at local feed to primary subvolume, because
> other feeds are only a special case of this).
> You even have the choice to not open the file over a mountpoint-access as long
> as the state is not restored to what you see as glusterfs-like - just as you
> say now for a split-brain case.

How would you distinguish the inconsistency between a locally created new file and a stale old file that should have been deleted? Journalling and versioning on the parent directory tells you that, and this would be unavailable (unupdated) when writing locally.

> I really would love to hear some detailed explanation why a local feed should
> be impossible to manage.

Nothing is impossible, but this use-case is wrought with race conditions, so the difficulty of adding such a feature would vastly exceed it's usefulness.

> The question is very important because there will be lots of potential users
> that are simply unable to copy their data because of the sheer size and time
> this would take - just like us.

If you are only looking for a one-off import/migration from one master server, that is much more tractable. Suspend write access to the data and write a program to add the xattrs as they would be after the initial copy to the gluster backing store on the primary server. Then bring the other servers online and do ls -laR on them to initialize the healing. That would be would be workable.

But why? Time to having data replicated wouldn't change unless your network i/o considerably exceeds your disk i/o performance.

> And there is no good reason for them to even
> think of migration knowing that all you have to do is wait for an upcoming
> pNFS implementation that surely allows soft migration and parallel use of
> other NFS versions. I don't say that pNFS is a better solution for the basic
> problem, but it is a possible solution that allows soft migration which is a
> really important factor.

Nice utopian dream, but this promise on NFS4 (pNFS) has been around for at least a decade, with thus far no production quality implementations. I don't expect to see any, either, in the next couple of years, probably longer.

> More than 30 years in application- and driver-programming have shown one thing
> for me: you will not be successful if you don't focus on the users with
> maximum expectations. All others are only subsets.

While I agree that it is nice to aim for the stars there is overwhelming evidence to show that this is not an economically sound way to run one's business. It comes down to effort/benefit.

> Failing only one important
> expectation (judged by user, not by progammer) will make your whole project
> fail. The simple reason for this is: the world does not wait for your project.

It's not my project (I'm just a user of it), but having done my research, my conclusion is that there is nothing else available that is similar to GlusterFS. The world has waited a long time for this, and imperfect as it may be, I don't see anything else similar on the horizon.

GlusterFS is an implementation of something that has only been academically discussed elsewhere. And I haven't seen any evidence of any other similar things being implemented any time soon. But if you think you can do better, go for it. :-)

Gordan