[Gluster-devel] Some FAQs ...

Thu Apr 26 10:13:04 UTC 2007

Hi A (which one is your first name???)

On Wed, Apr 25, 2007 at 08:17:18AM -0700, Anand Avati wrote:
> Hi Steffen,
>   answers inline.

Thanks for your almost exhaustive answer, see my comments in-line as well :)

> > - The two example configs are a bit confusing. In particular, I suppose I
> > 	don't have to assign different names to all 15 volumes? Different
> > 	ports are only used to address a certain sub-server?
> 
> are you referring to differnt protocl/client volume names in the
> client spec file? if so, yes, each volume for a server should have a
> seperate name. there can only be one volume with a given name in a
> graph (read: spec file)
> 
> > - This would mean I could use the same glusterfs-server.vol for all 
> > 	storage bricks?
> 
> 
> yes, the same glusterfs-server.vol can be used for all the servers.

... so all volumes on the servers can have the same name since they are
referred to by the *client-side* volume definitions 
(volume brick${serverid}; option remote-host ${serverip}; option remote-subvolume brick)

OK, that's becoming clear now.

> > - The "all-in-one" configuration suggests that servers can be clients at the
> > 	same time? (meaning, there's no real need to separately build
> > 	server and client)
> 
> the same machine can run the glutserfs server and the client. 

Which is indeed very nice, and allows for uniform namespaces across the whole
cluster (and even for abusing compute nodes as storage bricks if necessary).

> > - The instructions to add a new brick (reproduce the directory tree with 
> > 	cpio) suggest that it would be possible to form a GluFS from 
> > 	already existing separate file servers, each holding part of the
> > 	"greater truth", by building a unified directory tree (only
> > 	partly populated) on each of them, then unifying them using
> > 	GluFS. Am I right?
> 
> you are right!

This is simply great!!! Not sure what to do about ~20GB of data some weeks 
ago, I split them up to 3 7TB volumes (regular NFS now). This gives me the
opportunity of easy expansion at low level, while the cluster nodes see
the bigger image!

I'm pondering the idea to have two FSs now: one for regular operation, and
one for backups. rsnapshot comes to mind, but needs "hard links" - does
GluFS support them {and BTW, what's the official acronym, since GFS is already
taken and GlFS cannot be pronounced without twisting one's tongue?] ?
If that works, I could slowly grow the backup FS (use one brick until it's
nearly full, add another one ...)

This brings up another wishlist item: it should be possible to migrate all data
off one server (set server "read-only", then read every file and re-write to 
another location)...

> > - Would it still be possible to access the underlying filesystems, using
> > 	NFS with read-only export?
> 
> will be possible.

This is great in particular for the transition phase.

> > - What would happen if files are added to the underlying filesystem on one
> > 	of the bricks? Since there's no synchronization mechanism this should
> > 	look the same as f the file entered through GluFS?
> 
> it would look as if the file entered through glusterfs. but the unify
> translator expects a file to reside on only one server. if you are
> careful enough to add it by avoiding race conditions (where another
> glusterfs client creates the same file at the same time on another
> server) it would work. but it may not work once the name-space-cache
> translator (comes in the next release) is in use.

It's almost guaranteed that no one else would be able to write to this
particular area (in the past, I emulated this behaviour by NFS exporting read-only
and writing using rcp/scp as a special user). No races to be expected :-)

> > - What's the recommended way to backup such a file system? Snapshots?
> 
> the snapshot translator is in the roadmap as well. for now the
> recommended way to backup is to take the filesystem offline (umount
> all clients) and rsync all the servers.

With a compute cluster under heavy load (600+ clients, 1000+ processes) this
is close to impossible :-( Don't you think rsync/rsnapshot could do their job
even on an active filesystem?

> > - Is there a Debian/GNU version already available, or someone working on it?
> I recently saw a post about someone working on it -
> http://people.debian.org/~terpstra/message/20070418.192436.787e9c06.en.html

I'm in touch with Christian...

> > - Are there plans to implement "relaxed" RAID-1 by writing identical copies
> > 	of the same file (the same way AFR does) to different servers?
> 
> I do not quite understand what difference you are asking from the
> current AFR? do you mean relaxed as in, make the copy after the file
> is closed? please exlplain more clearly.

With "relaxed RAID-1" I meant to be able to store multiple copies of a single 
file (AFR) but without explicitly setting up a master-mirror pair. This way,
with e.g. 4 bricks, selected files (option replicate) would end up in multiple
copies on independent servers. (Did I mis-read the docs, and AFR would be 
available also with a regular multi-brick setup, without explicit mirror setup?)

Remark: if one of those servers was destroyed beyond repair the additional 
copy would be lost - so another wishlist item would be to check for replica
counts, and re-establishing redundancy in a background process.

> > _ I couldn't find any indication of metadata being kept somewhere - how do
> > 	I find out which files were affected if a brick fails and cannot
> > 	be repaired? (How does AFR handle such situations?) I suppose there
> > 	are no tools to re-establish redundancy when slipping in a fresh
> > 	brick - what's the roadmap for this feature?
> 
> in the current relase there is no way to know. but the distributed
> name space cache translator will not only keep the namespace alive,
> but also will  provide mechanisms to know which are the missing
> entires in the namespace (hence files gone with the dead server).
> 
> the AFR in the 1.4 release will have ways to replay the changes to a
> dead server since it went down. 

Which requires the server to come up at basically the state just before its
death. OK, but see above: reestablishing redundancy requires more than that.

> > More to come...
> 
> awaiting :)

That's really nice. If all developers were that nice, this world could be
a place to live in :-)

Don't worry, I will show up again as soon as I have read all the other stuff,
and finished my regular work...

Cheers,
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html