[Gluster-devel] solutions for split brain situation

Fri Sep 18 03:52:05 UTC 2009

2009/9/18 Mark Mielke <mark at mark.mielke.cc>

> On 09/17/2009 06:47 PM, Stephan von Krawczynski wrote:
>
>> Way above in this discussion I told that we only talk about the
>> first/primary
>> subvolume/backend for simplicity. It makes no sense to check a journal if
>> I
>> can stat the real file which I have to do anyway if an open/create arrives
>> -
>> and we are talking exactly about that. So please explain where is your
>> assumed
>> race? Really only a braindead implementation can race on an open. You can
>> delay a flush on close (like writebehind), but you can obviously not delay
>> an
>> open neither r,rw nor create because you have to know if the file is a)
>> existing and b) can be created if not. As long as you don't touch the
>> backend
>> you will not find out if a create may fail for disk-full or the like. It
>> may
>> as well fail because of access-privileges. whatever it is, you will not
>> find a
>> trusted answer without asking the backend, no journal will save you.
>>
>>
>
> Like most backend storages, the backend storage includes the data pages,
> the metadata, AND the journal. "Without asking the backend" and "no journal
> will save you" are not not understanding that the backend *includes* the
> journal.
>
> A scenario which should make this clear: Let's say the file a.c is removed
> a from a 2-node replication cluster. Something like the following should
> occur: Step 1 is to lock the resource. Step 2 is to record the intent to
> remove on each node. Step 3 is to remove on each node. Step 4 is to clear
> the intent from each node. Step 5 is to unlock the resource. Now, let's say
> that one node is not accessible during this process and it comes back up
> later. After it comes back up, should a process that happens to see the file
> does not exist on node 1, but does exist on node 2. Should the file exist or
> not? I don't know if GlusterFS even does this correctly - but if it does,
> the file should NOT exist. There should be sufficient information, probably
> in the journal, to show that the file was *removed*, and therefore, even if
> one node still has the file, the journal tells us that the file was removed.
> The self-heal operation should remove the file from the node that was down
> as soon as the discrepancy is detected.
>
> Correct me if I am wrong, but GlusterFS uses extended attributes on the
directory to note if direct children of the directory have been updated. For
instance, if you remove a file and one node is down, self-heal will find
that the last directory change on the down node is older than that of the
other nodes, bringing any create/unlink operations into line with the other
nodes.


> The point here, is that the journal SHOULD be consulted. If you think
> otherwise, I think you are not looking for a reliable replication cluster
> that implements POSIX guarantees.
>
> I think GlusterFS doesn't provide all of these guarantees as well as it
> should, but I have not done the full testing to expose how correct or
> incorrect it is in various cases. As it is, I just received a problem where
> a Java program trying to use file locking failed in a GlusterFS mount point,
> but succeeded in /var/tmp, so although I still think GlusterFS has
> potentially - I'm slowly backing down from what production data I am willing
> to store in it. It's unfortunate that this solution space seems so immature.
> I'm still switching back and forth between wondering if I should push / help
> GlusterFS into solving all of the problems, or just write my own solution.
>
> My favourite solution is a mostly asynchronous master-master approach,
> where each node can fall out of date from the other, as long as they touch
> different data, but that changes that do touch the same data become
> serialized. Unfortunately, this also requires the most clever implementation
> strategy as well, and clever can take time or exceptional talent.
>
>  Read again: I said "and not going over glusterfs for some unknown reason."
>>>> "unkown reason" means that I can think of some for myself but tend to
>>>> believe
>>>> there may be lots of others. My personal reason nr 1 is the soft
>>>> migration
>>>> situation.
>>>>
>>>>
>>> See my comment about writing a program to set up the xattr metadata for
>>> you
>>>
>>>
>> How about using the code that is there - inside glusterfsd.
>> It must be there, else you would not be able to mount an already populated
>> backend for the first time. Did you try? I did.
>>
>>
>
>
> This could mean that GlusterFS is too lax with regard to consistency
> guarantees. If files can appear in the background, and magically be shown -
> this indicates that GlusterFS is not enforcing use through the mount point,
> which introduces the potential for inconsistent or faulty results. You are
> asking for it to guess what you want, without seeing that what you are
> asking for is incompatible with provisions for any guarantee of a consistent
> view. That "it works" is actually more concerning to me that justifying over
> your position. To me it says it's one more potential problem that I might
> hit in the future. A file that should be removed magically re-appears - how
> is this a good thing?
>
> Cheers,
> mark
>

I guess the last question is a good one for the developers. If the required
extended attributes do not exist on the backend, should the
files/directories (excluding the root directory) show in a stat() call? That
may be a blessing or curse for new users, especially when this post has been
going on about automatic creation of extended attributes for pre-existing
files in the backend.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090918/ab8e6e36/attachment-0003.html>