[Gluster-devel] Some FAQs ...

Thu Apr 26 10:49:11 UTC 2007

> Hi A (which one is your first name???)

First name is Anand, which is ambiguous since there are two Anands
here :) I'm usually called avati while the other is called AB.

> 
> This is simply great!!! Not sure what to do about ~20GB of data some weeks 
> ago, I split them up to 3 7TB volumes (regular NFS now). This gives me the
> opportunity of easy expansion at low level, while the cluster nodes see
> the bigger image!
> 
> I'm pondering the idea to have two FSs now: one for regular operation, and
> one for backups. rsnapshot comes to mind, but needs "hard links" - does
> GluFS support them 

> {and BTW, what's the official acronym, since GFS is already
> taken and GlFS cannot be pronounced without twisting one's tongue?] ?

we call it just glusterfs, if you have a more createive shortform,
please suggest!

> If that works, I could slowly grow the backup FS (use one brick until it's
> nearly full, add another one ...)

yes, glusterfs supports hardlinks. but there is a small glitch that
the two hardlinks are listed with seperate inode numbers. this will be
fixed in the 1.4 release. for now hardlinks work as long as the
application doesnt cross verify the inode numbers.

> This brings up another wishlist item: it should be possible to migrate all data
> off one server (set server "read-only", then read every file and re-write to 
> another location)...

the snapshot translator seems to be what you are looking at. there is
currently no design document about it unfortunately.

> > > - What would happen if files are added to the underlying filesystem on one
> > > 	of the bricks? Since there's no synchronization mechanism this should
> > > 	look the same as f the file entered through GluFS?
> > 
> > it would look as if the file entered through glusterfs. but the unify
> > translator expects a file to reside on only one server. if you are
> > careful enough to add it by avoiding race conditions (where another
> > glusterfs client creates the same file at the same time on another
> > server) it would work. but it may not work once the name-space-cache
> > translator (comes in the next release) is in use.
> 
> It's almost guaranteed that no one else would be able to write to this
> particular area (in the past, I emulated this behaviour by NFS exporting read-only
> and writing using rcp/scp as a special user). No races to be expected :-)
> 
> > > - What's the recommended way to backup such a file system? Snapshots?
> > 
> > the snapshot translator is in the roadmap as well. for now the
> > recommended way to backup is to take the filesystem offline (umount
> > all clients) and rsync all the servers.
> 
> With a compute cluster under heavy load (600+ clients, 1000+ processes) this
> is close to impossible :-( Don't you think rsync/rsnapshot could do their job
> even on an active filesystem?

you would have to wait for the snapshot translator, it works in an
'incremental fashion' and works on an active filesystem.

> With "relaxed RAID-1" I meant to be able to store multiple copies of a single 
> file (AFR) but without explicitly setting up a master-mirror pair. This way,
> with e.g. 4 bricks, selected files (option replicate) would end up in multiple
> copies on independent servers. (Did I mis-read the docs, and AFR would be 
> available also with a regular multi-brick setup, without explicit mirror setup?)

If i understand your requirement correctly, AFR does the "relaxed
RAID-1" you are talking about. A file can be replicated to
multiple independant servers. You can use AFR+unify to achieve what (i
have understood of, atlest,) you ask for.

> Remark: if one of those servers was destroyed beyond repair the additional 
> copy would be lost - so another wishlist item would be to check for replica
> counts, and re-establishing redundancy in a background process.

the self-heal/fsck would check for the count as well.

> > > _ I couldn't find any indication of metadata being kept somewhere - how do
> > > 	I find out which files were affected if a brick fails and cannot
> > > 	be repaired? (How does AFR handle such situations?) I suppose there
> > > 	are no tools to re-establish redundancy when slipping in a fresh
> > > 	brick - what's the roadmap for this feature?
> > 
> > in the current relase there is no way to know. but the distributed
> > name space cache translator will not only keep the namespace alive,
> > but also will  provide mechanisms to know which are the missing
> > entires in the namespace (hence files gone with the dead server).
> > 
> > the AFR in the 1.4 release will have ways to replay the changes to a
> > dead server since it went down. 
> 
> Which requires the server to come up at basically the state just before its
> death. OK, but see above: reestablishing redundancy requires more than that.

Agree, but accepting a completely empty server will definitely be
supported to bring back the redundancy. the self-heal/fsck would be of
little use without that.

> That's really nice. If all developers were that nice, this world could be
> a place to live in :-)

You are yet to see the dark side ;)

avati

-- 
ultimate_answer_t
deep_thought (void)
{ 
  sleep (years2secs (7500000)); 
  return 42;
}