[Gluster-devel] solutions for split brain situation

Fri Sep 18 04:47:59 UTC 2009

> Correct me if I am wrong, but GlusterFS uses extended attributes on the
> directory to note if direct children of the directory have been updated. For
> instance, if you remove a file and one node is down, self-heal will find
> that the last directory change on the down node is older than that of the
> other nodes, bringing any create/unlink operations into line with the other
> nodes.

That is correct. This is exactly how things happen today.

> > This could mean that GlusterFS is too lax with regard to consistency
> guarantees. If files can appear in the background, and magically be shown -
> this indicates that GlusterFS is not enforcing use through the mount point,
> which introduces the potential for inconsistent or faulty results. You are
> asking for it to guess what you want, without seeing that what you are
> asking for is incompatible with provisions for any guarantee of a consistent
> view. That "it works" is actually more concerning to me that justifying over
> your position. To me it says it's one more potential problem that I might
> hit in the future. A file that should be removed magically re-appears - how
> is this a good thing?
>
> I guess the last question is a good one for the developers. If the required
> extended attributes do not exist on the backend, should the
> files/directories (excluding the root directory) show in a stat() call? That
> may be a blessing or curse for new users, especially when this post has been
> going on about automatic creation of extended attributes for pre-existing
> files in the backend.

We view the situation from a different point of view. When a file
appears in the backend without extended attributes, the benefit of
doubt is given to the user that the file was actually previously
created from the mountpoint and the extended attributes were lost in
an fsck - because we ourselves have seen some filesystems just prune
off extended attributes when running fsck. In fact the very same
situation where an entire rack power was tripped and both servers
ended up with files without extended attributes.

One way to look at this is that the replicate module is being very lax
about things without considering various scenarios. On the other hand,
we have carefully analysed various scenarios of network outages,
server reboots and disk fsck results - at various stages of the
transaction getting aborted and come to the decisions of whatever
self-heal does today. Every behavior of self heal (including healing a
file which got written to the backend directly) is intentional, but
the rationale might be different from the soft-migration feature or
feature to add files into the backend directly. Adding files directly
to the backend is basically misusing the self-heal. If you feel any of
the self-heal behavior is not supposed to be the way it should be, we
welcome you to bring it up for discussion.

The self-healing approach follows a best-effort strategy to fix back
things wherever it can. Whenever in doubt about a decision which could
result in data loss, it takes a conservative approach of preserving
data. For example, when one of your disks is fsck'ed and some file is
orphan-inode'd and disappears from the disk mount, then this is
equivalent to rm'ing the file directly from the backend with no traces
in the parent directectory xattrs. Next time the file is stat()'ed,
the parent directory journals say everything is consistant, but the
file is only partially existing. Should the file be deleted or
recreated? Whenever any such doubts (not just this specific case, but
any kind of doubt) are existing, glusterfs follows the conservative
approach of healing the content back on all the servers. Self-heal
deletions happen only when the extended attributes of the directory
unambiguously show that the file was supposed to be deleted.

The side-effect of this 'conservative' approach is that, glusterfs
appears to support soft-migration - while this was not the intended
feature. We neither QA nor document this "feature". If you understand
the implications completely after reading the source, and think that
this works for you, feel free to migrate like this. Just be aware that
this is an undocumented feature and any issues/races which might show
up in the process will be unsupported.

So if you come up with a scenario where a file which was supposed to
be gone, but gets recreated, then it is very likely that there is
another point of view or scenario where the same backend state is
reached where the file was supposed to exist - and glusterfs self-heal
takes the conservative approach of not destroying data when a
potential doubt exists. In simple cases where rm happens with a server
down, self-heal does indeed delete the file when the down server comes
back.

Avati