[Gluster-devel] solutions for split brain situation

Mon Sep 14 14:18:13 UTC 2009

On Mon, 14 Sep 2009 14:19:00 +0200
Daniel Maher <dma+gluster at witbe.net> wrote:

> Stephan von Krawczynski wrote:
> > Hello all,
> > 
> > we have seen several split brain situations and think that the most common
> > option for the situation is simply missing. You can define a favourite child,
> > but you cannot define to use the latest file copy as definitive. Why not?
> > Isn't it a logical approach to say that the latest copy of a file based on
> > mtime must be the most up-to-date and therefore being used in split brain
> > recovery?
> 
> Are you sure about that ?  If you have one file which, due to split 
> brain, now exists in two different states, and each of them are modified 
> during the event, which one is actually correct ?
> 
> Imagine a log file ; the split brain event occurs, and now this log file 
> exists in two different forms, both of which are being updated on either 
> side of the split.  Each of these files contains data which, while 
> incomplete, is still valid.  Clobbering the one with the older mtime 
> (even, as in the case of log files, by perhaps seconds) means that 
> you'll lose all of the log data that happened to be in the slightly 
> older file.
> 
> -- 
> Daniel Maher <dma+gluster at witbe.net>

I do not state that my suggestion is _the_ solution. Obviously there is no
single solution for all split brain situations. This is why I suggest to add
several options. 
Our "split brain" is no real split brain and looks like this: Logfiles are
written every 5 mins. If you add a secondary server that has 14 days old
logfiles on it you notice that about half of your data vanishes while not
successful self heal is performed, because the old logfiles read from the
secondary server overwrite the new logfiles on your primary while new data is
added to them. This is a very simple and solvable situation. All you had to do
to win the situation 100% is to compare the files' mtime.
Whereas a true split brain is rare, our situation arises every time you add a
server, maybe because you made a kernel update or needed a reboot for some
reason. Your secondary comes back and kicks your ass. Even better, it is
completely irrelevant which server gets re-added, as soon as you have old data
on it you are busted. 
You might argue to prevent that by simply deleting everything on a newly added
server. But if you deal with TBs of data you really do not want to spend the
time and network bandwidth to heal the data, when most of it is actually in
good shape and only some MBs or GBs are outdated.
Btw I know this is not what you call "split brain", but glusterfs thinks it
is, and that is part of the problem. It cannot distinguish the cases.
Your argument is broken anyways because in your situation you will loose the
data no matter if you keep the current implementation or create a new "option
favorite-child mtime" option. In the current implementation you will loose
about every other file content summing up to 100% of the files being damaged,
iff in a true split brain both servers get new data for their respective
fileset and are mixed together later on. If the file comes from server A you
lost all data added on server B during split brain and vice versa.
Thinking about it it sounds as if the current implementation is the worst
possible. There is really no good reason for distributing file access in a
split brain detect situation. At least it should then choose the same child
for following file access to prevent the 100% loss.
Another idea would be switching split-brain files to read-only access. This
would be the conservative approach of not loosing already written data - only
new writes get lost this way.

-- 
Regards,
Stephan