[Gluster-devel] solutions for split brain situation

Tue Sep 15 10:16:24 UTC 2009

On Tue, 15 Sep 2009 09:45:43 +0530
Anand Avati <avati at gluster.com> wrote:

> > I have problems in understanding exactly the heart of your question. Since the
> >  main reason for rsyncing the data was to take a backup of the primary server,
> >  so that self heal would have less to do it is obvious (to me) that it has been
> >  a subvolume of the replicate. In fact is was a backup of the _only_ subvolume
> >  (remember we configured a replicate with two servers, where one of them was
> >  actually not there until we offline fed it with the active servers' data and
> >  then tried to switch it online in glusterfs.
> >  Which means: of course the "old backup" was a subvolume and of course it was
> >  populated with some data when the other subvolume was down. Let me again
> >  describe step by step:
> >
> >  1) setup design: one client, two servers with replicate subvolumes
> >  2) switch on server 1 and client
> >  3) copy data from client to glusterfs exported server dir
> >  4) rsync exported server dir from server 1 to server 2 (server 2 is not
> >  running glusterfsd at that time)
> >  5) copy some more data from client to glusterfs exported server dir (server 2
> >  still offline). "more data" means files with same names as in step 3) but
> >  other content.
> >  6) bring server 2 with partially old data online by starting glusterfsd
> >  7) client sees server2 for the first time
> >  8) read in the "more data" from step 5 and therefore get split brain error
> >  messages in clients log
> >  9) write back the "more data" again and then watch the content =>
> >  10) based on where the data was read (server1 or server2) from glusterfs
> >  client the file content is from step 5) (server 1 was read) or step 3) (server
> >  2 was read).
> >
> >  result:the data from step 3 is outdated because glusterfs failed to notice
> >  that the same files (filenames) existing on server 1 and 2 are indeed new
> >  (server1) and old (server2) and therefore _only_ files from server1 should
> >  have been favorite copies. glusterfs could have notices this by simply
> >  comparing the file copies mtimes. But it thinks this was split brain - which it
> >  was not, it was simply a secondary server being brought up for the very first
> >  time with some backup'ed fileset - and damages the data by distributing the
> >  reads between server 1 and 2.
> 
> 
> There are a few questions in my mind, three specifically. There are
> split-brain log messages in your log clearly. And there is the
> analysis you have done in the above paragraph trying to explain what
> is happening internally. I need some more answers from you before I
> understand what is exactly happening.
> 
> 0. Users are not supposed to read/write from the backend directly.
> Using rsync the way you have is not a supported mode of operation. The
> reason is that both the presence and absence of glusterfs' extended
> attributes influence the working of replicate. The absence of extended
> attributes on one subvol is acceptable only in certain cases. By
> filling up the backup export directories (while other subvolume data
> was generated by glusterfs) you might be unknowingly hand-crafting a
> split brain situation. All backups should be done via the mountpoint.
> 
> 1. Are you sure the steps you describe above was all what was done?
> Was the original data (which was getting copied onto the mountpoint
> while the second server was down) itself one of the backends of a
> previous glusterfs setup?
> 
> 2. Are the log entries in your logfile directly corresponding to those
> files which you claim to have been wrongly healed/misread? The reason
> I ask this is because of two important reasons -
> 
> 2a. When a file is diagnosed to be split brain'ed, open() is forced to
> explicitly fail. There is no way you could have read data from either
> the wrong server or the right server. But you seem to have been able
> to read data (apparantly) from the wrong server.
> 
> 2b. You describe that glusterfs "healed it wrongly" by not considering
> the mtime. The very meaning of split brain is that glusterfs could
> "NOT" heal. It neither healed it the right way, nor the wrong way. It
> could not even figure out which of them was right or wrong and just
> gave up.
> 
> As you can see, there seems to be inconsistency in what you describe.
> Can you please reproduce it with a minimal steps and a fresh data set
> (without possible stale extended attributes taken directly off a
> previous glusterfs backend) and tell us if the problem still persists?
> 
> Avati

Hello Avati,

please allow to ask a general question regarding data feed for
glusterfs-exported files: Is it valid to bring new files to a subvolume over
the local fs and not the glusterfs mounting client?
The question targets the very start of a glusterfs setup. Does a user have to
start with a (i.e. all) completely empty subvolume(s) or can he (just like
with nfs) simply export and already existing bunch of files and dirs?
Obviously we are only talking about replicate setups for the moment.
Does glusterfs handle a situation correctly where some new file simply shows
up on the first subvolume (maybe because the user copied it on the local
servers' fs)? Does it need extended attribs set somehow from the very
beginning, or can it simply accept a file that has currently none set and just
use it? Keep in mind that the user expects that glusterfs works somehow
straight forward just like nfs. If you have to begin with exported subvolumes
all being empty you will have a lot more troubles in migration to glusterfs.

-- 
Regards,
Stephan