[Gluster-devel] solutions for split brain situation

Fri Sep 18 10:44:16 UTC 2009

On Thu, 17 Sep 2009 20:27:13 -0400
Mark Mielke <mark at mark.mielke.cc> wrote:

> On 09/17/2009 06:47 PM, Stephan von Krawczynski wrote:
> > Way above in this discussion I told that we only talk about the first/primary
> > subvolume/backend for simplicity. It makes no sense to check a journal if I
> > can stat the real file which I have to do anyway if an open/create arrives -
> > and we are talking exactly about that. So please explain where is your assumed
> > race? Really only a braindead implementation can race on an open. You can
> > delay a flush on close (like writebehind), but you can obviously not delay an
> > open neither r,rw nor create because you have to know if the file is a)
> > existing and b) can be created if not. As long as you don't touch the backend
> > you will not find out if a create may fail for disk-full or the like. It may
> > as well fail because of access-privileges. whatever it is, you will not find a
> > trusted answer without asking the backend, no journal will save you.
> >    
> 
> Like most backend storages, the backend storage includes the data pages, 
> the metadata, AND the journal. "Without asking the backend" and "no 
> journal will save you" are not not understanding that the backend 
> *includes* the journal.
> 
> A scenario which should make this clear: Let's say the file a.c is 
> removed a from a 2-node replication cluster. Something like the 
> following should occur: Step 1 is to lock the resource. Step 2 is to 
> record the intent to remove on each node. Step 3 is to remove on each 
> node. Step 4 is to clear the intent from each node. Step 5 is to unlock 
> the resource. Now, let's say that one node is not accessible during this 
> process and it comes back up later. After it comes back up, should a 
> process that happens to see the file does not exist on node 1, but does 
> exist on node 2. Should the file exist or not? I don't know if GlusterFS 
> even does this correctly - but if it does, the file should NOT exist. 
> There should be sufficient information, probably in the journal, to show 
> that the file was *removed*, and therefore, even if one node still has 
> the file, the journal tells us that the file was removed. The self-heal 
> operation should remove the file from the node that was down as soon as 
> the discrepancy is detected.

Only, we talked about OPEN and not REMOVE. Your example comes up with a broken
replicate situation for a remove and you correctly say that after the second
subvolume goes alive again the remove should be completely done.
You may well ask the journal for that, and it tells you that the file on the
second should be removed. Now, if you enter the same situation for a local fed
file on secondary I would simply suggest - since there is no journal telling
you to remove - that the file should be valid and _not_ removed, but
replicated to node 1.
Since this decision can be taken based on the journal, both setups have a
valid answer. Still there is no race. Open in first setup fails, open in
second setup succeeds. Nevertheless both open tries need a stat to check for
the files' existence. The first stat finds the file should be gone, the second
stat replicates the file to node 1 and open can succeed. And guess what:
exactly that happens on glusterfs. If you stat a file that is only available
on secondary node, it gets replicated.

> The point here, is that the journal SHOULD be consulted.

You omitted the most important word: "too". The journal should be consulted
too. Nevertheless it cannot be the only reason for decision.

> If you think 
> otherwise, I think you are not looking for a reliable replication 
> cluster that implements POSIX guarantees.
> 
> I think GlusterFS doesn't provide all of these guarantees as well as it 
> should, but I have not done the full testing to expose how correct or 
> incorrect it is in various cases. As it is, I just received a problem 
> where a Java program trying to use file locking failed in a GlusterFS 
> mount point, but succeeded in /var/tmp, so although I still think 
> GlusterFS has potentially - I'm slowly backing down from what production 
> data I am willing to store in it. It's unfortunate that this solution 
> space seems so immature. I'm still switching back and forth between 
> wondering if I should push / help GlusterFS into solving all of the 
> problems, or just write my own solution.
> 
> My favourite solution is a mostly asynchronous master-master approach, 
> where each node can fall out of date from the other, as long as they 
> touch different data, but that changes that do touch the same data 
> become serialized. Unfortunately, this also requires the most clever 
> implementation strategy as well, and clever can take time or exceptional 
> talent.
> 
> >>> Read again: I said "and not going over glusterfs for some unknown reason."
> >>> "unkown reason" means that I can think of some for myself but tend to believe
> >>> there may be lots of others. My personal reason nr 1 is the soft migration
> >>> situation.
> >>>        
> >> See my comment about writing a program to set up the xattr metadata for you
> >>      
> > How about using the code that is there - inside glusterfsd.
> > It must be there, else you would not be able to mount an already populated
> > backend for the first time. Did you try? I did.
> >    
> 
> 
> This could mean that GlusterFS is too lax with regard to consistency 
> guarantees. If files can appear in the background, and magically be 
> shown - this indicates that GlusterFS is not enforcing use through the 
> mount point, which introduces the potential for inconsistent or faulty 
> results.

If that were true, nfs would have never worked.

> You are asking for it to guess what you want, without seeing 
> that what you are asking for is incompatible with provisions for any 
> guarantee of a consistent view.

No, read above. There is no guess involved. you can talk every decision based
on the data provided during the situation. Of course you have to look at the
data provided and not only half.

> That "it works" is actually more 
> concerning to me that justifying over your position. To me it says it's 
> one more potential problem that I might hit in the future. A file that 
> should be removed magically re-appears - how is this a good thing?

Read above, there is a clear decision possible for your example if it should
be gone or not. There is not the slightest hint that your example has chances
to fail. If you want to make the question really interesting try to modify the
situation to a failing node 1 and ask where the journal really is located (and
the locks btw). Your example contains no hint where the journal is and how it
is designed to be redundant. If there is a race at all it is likely in the
journal creation. If you have the journal only lying on node 1 it is obvious
that all situation bringing node 1 down have an inconsistency in the journal.
Even if the journal is lying on several nodes you will find ways to
inconsistency by intelligent breakdown of selected nodes. That should tell you
that a journal is a hint, not more. And for sure not qualified as a single
reason for any trusted decision.

> Cheers,
> mark

-- 
Regards,
Stephan