[Gluster-users] some thoughts please on setting up a software archive based on glusterfs

Mon Jul 28 17:36:57 UTC 2008

At 09:54 AM 7/28/2008, webmaster at securitywonks.org wrote:
>hopefully, if we define number of copies in AFR, will it take care of
>things and do replications?.

see the AFR examples on the wiki, but basically for each subvolume 
listed in the cluster/afr translator, there will be a copy.
So, if you list 3 servers, there will be 3 copies..  if you list 8 
there will be 8 copies.
However, be aware, AFR does NOT do ACTIVE repairing.... this means, 
if server 3 is down for a period of time, and files change on servers 
1 and 2, server 3 will be out of sync until those files are 
accessed.  At this point, the AFR translator will notice server 3 is 
out of sync and will update the files on it.
here's the downside:
lets assume you have only a 2 server AFR setup.
Server 1 goes down..  files updated on server 2.  then server 1 comes up.
those files are not accessed so server 1 doesn't get fresh copies.
now server 2 goes down.
when you go to access those initial files they'll be accessed from 
server 1 and will be the older version.
This is where multiple mirrors comes in handy.  if you have 3 copies, 
the likelihood of having this situation goes down.
also, one of the AFR wiki articles discusses a find command which 
will stimulate the self-heal feature to bring the replica's back in sync.

>one more thing is, I find RAID5 or RAID6 or RAID10 or RAID60 is required.
>I also read a statement that, either AFR needs to be enabled or we need to
>use RAID levels to have data redundancy.

I dont think the gluster dev plan to bring this level of raid in a 
single translator.
you can sort of simulate raid 0+1, but not any higher raid levels.

I believe, what you'd do to get raid 0+1 is to set up the stripe 
translator before the AFR translator.
So, you might stripe across server 1,2,3   and another stripe across 
server 4,5,6.
then AFR stripe123  and stripe 456

Honestly, I wouldn't risk this..  Unless your files are HUGE the 
performance gain wont be worth the risk in my opinion.

>which one you recommend?
>
>what is the minimum number of copies we can make using AFR for added
>redundancy? (I read Google stores 3 copies of it's data for added
>redundancy, can we follow that rule and keep 3 copies using AFR?) or keep
>some more copies?

more is always better.  if you can afford it, store 10.  it has to do 
with how many servers you want to manage, how much disk space you 
want to buy, etc...

>then, what are your thoughts about RAID levels sir?
>
>is RAID1 ok inthe above situation or, alternatively, keeping economics is
>mind, if we go with multiple AFR copies, can we proceed. Please share your
>thoughts on this, thank you

Daniel may have a different opinion, but those are my thoughts for 
you to consider.

> >> I read in some document that FTP, SSH can be used for uploading files
> >> to GlusterFS based system.
> >
> > A Gluster client process simply uses Fuse to create the mountpoint.
> > Once the mountpoint exists, it can be accessed just like any other
> > directory in the filesystem, thus any normal way of creating,
> > modifying, or deleting files is usable.  Basically anything that can
> > interact with filesystem objects can interact with a Gluster mountpoint
> > (FTP and SCP included).
>
>I read about fuse before, from whose website, I came to know about these
>Userspace file systems. please tell me, if we have to use one Gluster
>CLient per server? or how do you count that?

as far as I know..   the gluster server can serve volumes from a 
single .vol file
I think it's not recommended to access multiple different volumes 
from a single server, but my guess is that it might work.. I presume 
Daniel will correct me if I'm wrong.

You can have a separate client process running on a system.  OR you 
can have a single client/server process.

the client vol file when used to mount a filesystem uses the last 
configured volume as the source of the mount.
In this regard, it seems that the primary use is that any given 
machine can be a single server and/or a single client.  but you can't 
have one machine which acts as multiple gluster servers.

So, in your situation, if you want to have 3 mirrors.. you need 3 
machines running as gluster servers.

>my initial plan to start this website is to use one dedicated server (for
>web server, mysql server purpose), I wish to use the same as gluster
>client as well from which I will initiate http file download requests.
>Likewise, I think, I need to use the same server to upload files to the
>glusterfs based storage servers.

This is similar to the configuration I have.
I have 2 machines.  I'm using the AFR translator to mirror the data 
across them.
the AFR volume is mounted as /home
I then have apache virtual hosts all in /home
for MySQL, you would not want to put your mysql database files on top 
of gluster.
use MySQL replication.  it does require some attention, but you 
really really really do NOT want to try to run multiple mysql 
instances on top of shared db files.

>I wish to use 2 glusterfs storage servers initially and grow them as along
>the site growth.

This is my plan also.
Once I get a pair working and stabilized, adding a third server 
should be fairly trivial.

Since AFR does active self healing, it's possible to set up a server 
with an empty filesystem, add it to the AFR volumes, and it will copy 
data over from the other server(s) as it's requested.

>please share your thoughts sir.
>
> >
> >> I am currently trying to find if there is any other documentation that
> >> clarifies this situation. Also, more info on how we will construct a
> >> url to the hosted files using http protocol, also will they be
> >> accessible directly or with a password etc lot of questions.
> >
> > http://httpd.apache.org/docs/2.2/
>
>thanks for confirmation for this as well, so, you mean, we construct file
>headers etc as normal as before in the same way. I read in a doc that,
>clients authentication occur either with pre-defined list of IP addresses
>(glusterfs clients) or by using pre-defined list of username/password
>combinations. hopefully, we have a better way of using it, thank you

I think you're asking 2 different questions.
Configure apache as you normally would.. just make sure the 
filesystem the apache virtualhosts are using is within the gluster mount point.

your followup question is related to the gluster server configuration 
and there's lots of info in the wiki about that.
I use the IP based auth.  Mostly because this is webserver data and 
if someone spoofs the IP and somehow grabs the gluster stream, 
they're only going to get data they could get by using a web browser 
for the most part, so I'm not overly concerned about that level of security.

> >
> >> can we use php file system functions directly to deal with files
> >> hosted on glusterfs based system?
> >
> > Yes.
>
>  I am relaxed a bit better after your confirmation in writing that I can
>use php file system functions, ftp and scp, ssh functions the same way as
>before even with glusterfs file system.

once mounted, a gluster filesystem is the same as any other 
filesystem.  So think of it as you would any other filesystem.

your applications (apache, php, etc.) will be none the wiser.

My advice would be to contact the zresearch folks (you can find them 
via www.gluster.com) and find out what their rates are for 
professional services.
Given your knowledge level, it would probably be helpful to hire 
someone to help you get past your first configuration, after which 
you should be able to plug along just fine.
(you can contact me for implementation consulting also, but since 
your issues are mostly gluster related, it's sometimes best to go 
straight to the source)

Keith
p.s.  If it wasn't clear, I'm just a gluster user, not a developer, 
so my opinions are form an operational perspective.