[Gluster-devel] solutions for split brain situation

Tue Sep 15 04:15:43 UTC 2009

> I have problems in understanding exactly the heart of your question. Since the
>  main reason for rsyncing the data was to take a backup of the primary server,
>  so that self heal would have less to do it is obvious (to me) that it has been
>  a subvolume of the replicate. In fact is was a backup of the _only_ subvolume
>  (remember we configured a replicate with two servers, where one of them was
>  actually not there until we offline fed it with the active servers' data and
>  then tried to switch it online in glusterfs.
>  Which means: of course the "old backup" was a subvolume and of course it was
>  populated with some data when the other subvolume was down. Let me again
>  describe step by step:
>
>  1) setup design: one client, two servers with replicate subvolumes
>  2) switch on server 1 and client
>  3) copy data from client to glusterfs exported server dir
>  4) rsync exported server dir from server 1 to server 2 (server 2 is not
>  running glusterfsd at that time)
>  5) copy some more data from client to glusterfs exported server dir (server 2
>  still offline). "more data" means files with same names as in step 3) but
>  other content.
>  6) bring server 2 with partially old data online by starting glusterfsd
>  7) client sees server2 for the first time
>  8) read in the "more data" from step 5 and therefore get split brain error
>  messages in clients log
>  9) write back the "more data" again and then watch the content =>
>  10) based on where the data was read (server1 or server2) from glusterfs
>  client the file content is from step 5) (server 1 was read) or step 3) (server
>  2 was read).
>
>  result:the data from step 3 is outdated because glusterfs failed to notice
>  that the same files (filenames) existing on server 1 and 2 are indeed new
>  (server1) and old (server2) and therefore _only_ files from server1 should
>  have been favorite copies. glusterfs could have notices this by simply
>  comparing the file copies mtimes. But it thinks this was split brain - which it
>  was not, it was simply a secondary server being brought up for the very first
>  time with some backup'ed fileset - and damages the data by distributing the
>  reads between server 1 and 2.

There are a few questions in my mind, three specifically. There are
split-brain log messages in your log clearly. And there is the
analysis you have done in the above paragraph trying to explain what
is happening internally. I need some more answers from you before I
understand what is exactly happening.

0. Users are not supposed to read/write from the backend directly.
Using rsync the way you have is not a supported mode of operation. The
reason is that both the presence and absence of glusterfs' extended
attributes influence the working of replicate. The absence of extended
attributes on one subvol is acceptable only in certain cases. By
filling up the backup export directories (while other subvolume data
was generated by glusterfs) you might be unknowingly hand-crafting a
split brain situation. All backups should be done via the mountpoint.

1. Are you sure the steps you describe above was all what was done?
Was the original data (which was getting copied onto the mountpoint
while the second server was down) itself one of the backends of a
previous glusterfs setup?

2. Are the log entries in your logfile directly corresponding to those
files which you claim to have been wrongly healed/misread? The reason
I ask this is because of two important reasons -

2a. When a file is diagnosed to be split brain'ed, open() is forced to
explicitly fail. There is no way you could have read data from either
the wrong server or the right server. But you seem to have been able
to read data (apparantly) from the wrong server.

2b. You describe that glusterfs "healed it wrongly" by not considering
the mtime. The very meaning of split brain is that glusterfs could
"NOT" heal. It neither healed it the right way, nor the wrong way. It
could not even figure out which of them was right or wrong and just
gave up.

As you can see, there seems to be inconsistency in what you describe.
Can you please reproduce it with a minimal steps and a fresh data set
(without possible stale extended attributes taken directly off a
previous glusterfs backend) and tell us if the problem still persists?

Avati