[Gluster-devel] solutions for split brain situation

Thu Sep 17 14:36:47 UTC 2009

On Wed, 16 Sep 2009 16:04:18 +0100
Gordan Bobic <gordan at bobich.net> wrote:

> 
> > Lets make a trivial setup, lots of data for webservers and some ftp servers
> > for feeding in and deleting old. The first thing in sight: compared to the
> > reads there are very few writes, mostly sequential logfiles. And another
> > thing: most of the data does not get read nor written the whole day long.
> > This is a pretty common example I would say. Since really very few changes are
> > going on compared to the total amount of stored data you may call the
> > situation pseudo-static.
> > What would you expect in that setup? Lets say the bad boys (ftp servers) are
> > local feeds and not going over glusterfs for some unknown reason.
> > What do they really do to the data? They delete (the data is gone afterwards,
> > so there is no problem at all), they write new files. It should be very simple
> > for glusterfs to detect a local fed new file, because it has no xattribs at
> > all (assuming every glusterfs-fed file has some (*)). So basically all you
> > have to do is try to write-lock the file on the backend store, create its
> > xattribs default, unlock and do a stat for self-healing the other subvolumes -
> > lets call such a thing "import".
> > Does that really sound unsolvable? (For simplicity we assume such local feeds
> > only on the first subvolume, and the cluster being replicate)
> 
> 1) There is a race condition in what you describe. Since you mentioned 30 years in development, I assume you know what that means. Consider this:
> You are "locally feeding" file "x" on server1. During this, the same file gets created via the mountpoint on server2. What would you expect to happen, in a fs that aims for full posix compliance on atomic operations?

Sorry for a simple logic I took for granted: if I create a file on a fs and
the fs finds I can do that, I feel ok. If glusterfsd creates a file on a fs
and the fs tells it it is ok, it should. We cannot meet at the same time,
because there is no "same time" for fs requests. First request will win.
My hope is that glusterfs is not "creating" files on the client side without
having checked the backend storage for their existence or absence.
Is that assumption incorrect?

> 2) The example you give doesn't, in any way, provide justification for not copying the file in via the mountpoint in the first place.

Read again: I said "and not going over glusterfs for some unknown reason." 
"unkown reason" means that I can think of some for myself but tend to believe
there may be lots of others. My personal reason nr 1 is the soft migration
situation. If you do not want to run into a downtime you cannot copy large
amounts of data that are continously changing. This means there is a runtime
at which both, old and new server exports must run/work concurrently on the
backend storage without negative interaction.

> > (*) IF not every glusterfs file has xattribs then "import" is even simpler and
> > can be done by just stat'ing. This case sounds pretty automagically happening
> > on first touching of the new file over glusterfs mountpoint.
> 
> Not quite - you are forgetting the directory metadata, which is necessary to keep track of created/deleted files.

Really I am not familiar with what glusterfs really does with the metadata,
but I do assume that there cannot be a real difference between creating a
default xattr set for files or dirs. If a directory is on the storage and
glusterfs thinks it is unknown (i.e. local feed) why should it then be unable
to import it just like standard files? 

> > Another story: the backup 
> > I am pretty astonished that you all talk about backuping the xattribs. But
> > according to your own clean philosophy there should be no problem for backups
> > without xattribs as long as they are read in from the glusterfs mountpoint.
> 
> Yes, so far nothing astonishing there - you need either the snapshot of backing store incl. xattrs, or the mount point sourced data.
> 
> > Since other applications do not honor the xattribs either that can only mean
> > that a backup must be a complete snapshot without them.
> 
> No more than in any other setting, if you are reading from the mount point. From the backend incl. xattrs, it should be a snapshot to ensure consistent metadata state.
> 
> Backup with xattribs in this sense can only be useful at all if read local
> from the backend store to be able to recover that backend later on - including
> the information hidden in the xattribs. But since you would not want to deal
> with local data at all this should be no backup method at all.
> 
> You are extrapolating, and incorrectly. This sceanrio (backup of a snapshot including xattrs) would work fine. It is equivalent to a server recoining the cluster after an outage.

Did the glusterfs team tell that somewhere? Just to make sure that your point
is a point. Is it a valid approach to backup a subvolume with xattribs by
local access? Remember, the question was not "does it work?", the question is
"is it supported by design?".
I can tell you for sure that feeding local files does work with glusterfs
though the teams tells that this is definitely not supported, which means it
is not supported by design. My goal is academic: I would like to argue for
"support by design".

> > Even from my bad boy position I would not backup xattribs via local feed.  The
> > reason for me lies in restore. If I local-restore a file without xattribs
> > I give glusterfs a realistic change to notice that this is a local fed file and
> > should probably be handled like discussed above ("import").
> 
> You missed the point somewhere. If you are backing up the snapshot of the backing store you SHOULD backup/restore the xattrs. The important thing is for data and metadata to be in a consistent state. As long as that is the case, the files will self-heal correctly when the restored server rejoins the cluster.

See above: you can only argue it is consistent if the team tells that the
local backup with xattribs is supported by design of glusterfs. Else you are
just like me saying "it works (for me)".

> > But if I
> > local-restore a file with xattribs it is likely that these contain a currently
> > invalid state.
> 
> Sure, thus you have to either snapshot before the backup, or better, unmount the server process on the server you are using for backing up.
> 
> > My guess is that this will harm glusterfs more than not having
> > xattribs for the file at all because there is possibly no good way to find out
> > the invalid state.
> 
> Sounds like the problem is that you are expecting (hoping for?) correct results when following incorrect procedures. If you stick with the approaches I outlined above for the use-cases you mentioned, it will do what you want.
> 
> Gordan

It won't allow soft migration (from the design point of view) though it does
in real world accept local feeds. Everybody can try that: take two servers,
feed one (primary) with data of your choice, then start glusterfsd on both,
import it on some client as replicate and do an ls. You will see self-healing
going on and the data being replicated on the second server.
This means glusterfs was able to read the local fed files and replicate them
as well as access them without them having xattribs at mount time. So the
basic functionality is there per se. All I try to achieve is the acceptance
that is in fact a feature that's worth including in global design.
I can accept that neat corner cases of the question are omitted. I would not
expect that two applications - one glusterfs based, the other local fs based -
can access the same file at the same time for writing. 
-- 
Regards,
Stephan