[Gluster-devel] solutions for split brain situation
Mark Mielke
mark at mark.mielke.cc
Fri Sep 18 00:27:13 UTC 2009
On 09/17/2009 06:47 PM, Stephan von Krawczynski wrote:
> Way above in this discussion I told that we only talk about the first/primary
> subvolume/backend for simplicity. It makes no sense to check a journal if I
> can stat the real file which I have to do anyway if an open/create arrives -
> and we are talking exactly about that. So please explain where is your assumed
> race? Really only a braindead implementation can race on an open. You can
> delay a flush on close (like writebehind), but you can obviously not delay an
> open neither r,rw nor create because you have to know if the file is a)
> existing and b) can be created if not. As long as you don't touch the backend
> you will not find out if a create may fail for disk-full or the like. It may
> as well fail because of access-privileges. whatever it is, you will not find a
> trusted answer without asking the backend, no journal will save you.
>
Like most backend storages, the backend storage includes the data pages,
the metadata, AND the journal. "Without asking the backend" and "no
journal will save you" are not not understanding that the backend
*includes* the journal.
A scenario which should make this clear: Let's say the file a.c is
removed a from a 2-node replication cluster. Something like the
following should occur: Step 1 is to lock the resource. Step 2 is to
record the intent to remove on each node. Step 3 is to remove on each
node. Step 4 is to clear the intent from each node. Step 5 is to unlock
the resource. Now, let's say that one node is not accessible during this
process and it comes back up later. After it comes back up, should a
process that happens to see the file does not exist on node 1, but does
exist on node 2. Should the file exist or not? I don't know if GlusterFS
even does this correctly - but if it does, the file should NOT exist.
There should be sufficient information, probably in the journal, to show
that the file was *removed*, and therefore, even if one node still has
the file, the journal tells us that the file was removed. The self-heal
operation should remove the file from the node that was down as soon as
the discrepancy is detected.
The point here, is that the journal SHOULD be consulted. If you think
otherwise, I think you are not looking for a reliable replication
cluster that implements POSIX guarantees.
I think GlusterFS doesn't provide all of these guarantees as well as it
should, but I have not done the full testing to expose how correct or
incorrect it is in various cases. As it is, I just received a problem
where a Java program trying to use file locking failed in a GlusterFS
mount point, but succeeded in /var/tmp, so although I still think
GlusterFS has potentially - I'm slowly backing down from what production
data I am willing to store in it. It's unfortunate that this solution
space seems so immature. I'm still switching back and forth between
wondering if I should push / help GlusterFS into solving all of the
problems, or just write my own solution.
My favourite solution is a mostly asynchronous master-master approach,
where each node can fall out of date from the other, as long as they
touch different data, but that changes that do touch the same data
become serialized. Unfortunately, this also requires the most clever
implementation strategy as well, and clever can take time or exceptional
talent.
>>> Read again: I said "and not going over glusterfs for some unknown reason."
>>> "unkown reason" means that I can think of some for myself but tend to believe
>>> there may be lots of others. My personal reason nr 1 is the soft migration
>>> situation.
>>>
>> See my comment about writing a program to set up the xattr metadata for you
>>
> How about using the code that is there - inside glusterfsd.
> It must be there, else you would not be able to mount an already populated
> backend for the first time. Did you try? I did.
>
This could mean that GlusterFS is too lax with regard to consistency
guarantees. If files can appear in the background, and magically be
shown - this indicates that GlusterFS is not enforcing use through the
mount point, which introduces the potential for inconsistent or faulty
results. You are asking for it to guess what you want, without seeing
that what you are asking for is incompatible with provisions for any
guarantee of a consistent view. That "it works" is actually more
concerning to me that justifying over your position. To me it says it's
one more potential problem that I might hit in the future. A file that
should be removed magically re-appears - how is this a good thing?
Cheers,
mark
--
Mark Mielke<mark at mielke.cc>
More information about the Gluster-devel
mailing list