[Gluster-users] some thoughts please on setting up a software archive based on glusterfs

Tue Jul 29 15:31:43 UTC 2008

Dear Keith

> At 08:51 PM 7/28/2008, webmaster at securitywonks.org wrote:
>>Dear Keith
>>
>>I am thinking on to start with 2 gluster servers, anyhow, if possible in
>>the first step, I will consider 3 gluster servers as well for more
>>redundancy, have to see, how effective I can get this implemented in the
>>first time.
>>
>>just another request: please also tell me, if we can stimulate FIND
>>command more regularly to keep all glusterfs servers in sync almost all
>>the time.
>
> Well, I'm not sure it's necessary..  you technically "could" run it
> via cron, however, realize it's pretty IO intensive.  each file
> access causes gluster to look at the underlying filesystem, AND ask
> each of the other servers for the xattr's (versions) of the
> file.  so, if you do this often over a very large filesystem, I'm
> guessing it'll have a negative impact on performance.
>
> You really only need to stimulate auto-healing if somethings gone
> wrong.  If it's this important to you, perhaps you could write a
> script to watch the gluster log for server disconnects..  if a server
> experiences one.. then run the find, otherwise, no need.
> And then if a server is down, after it's back up, run the find.
>
> Otherwise, there shouldn't be a need to do this.

that's a nice idea to read logs and initiate FIND command only at the
required situation.

>
>> > Honestly, I wouldn't risk this..  Unless your files are HUGE the
>> > performance gain wont be worth the risk in my opinion.
>> >
>>
>>the file sizes of the files that I am going to host on ourwebsite range
>>from few Kilo Bytes to few hundred Mega Bytes (even upto 700 MB and
>>sometimes DVD files as well). Now, how do you suggest sir?
>
> you may want to experiment with the stripling and the .. I forgot
> what it's called, but the volume which allows you to specify what
> filetypes go on what volume.   You can create a AFR'ed stripe volume
> for the .mpeg/.vob files and have the rest of the files using normal
> non-stripped AFR, however, if you're mostly reading these files, you
> might be better off with just a normal AFR and maybe add a caching volume.
> Hopefully someone else reading can give you better advice on this subject.
>

I am interested to host Software files, so, almost all of them will be
.exe, .zip, .tar.gz etc extensions. In future, I have plans of streaming
service, but currently, hosting software files is my priority and
immediate requirement sir.

>>I am thinking on to use single hard drive like 500GB SATA per glusterFS
>>server with 2 to 4GB RAM each. Thinking about Virtual systems as well, I
>>mean, how it will be if we host gluster servers as individual virtual
>>containers on gogrid.com , amazon utility hosting or some other service
>> as
>>well. That is one thought I am considering, even though for now, I am
>>towards dedicated servers mainly.
>
> I'm not familiar with the gogrid.com offering, so I can't speak to that.
> you could likely use any utility hosting just have to figure out what
> works best for your situation... honestly, I think once you get
> things configured it's very low maintenance, and in the long run will
> be cheaper to run your own setup.
>

you are true, hosting our dedicated servers will be cost effective in
longterm definetely.

>>can we use CPANEL/Direct Admin as control panel on glusterfs servers? I
>>mean, will glusterfs work on control panel based servers?
>
> I use CPanel.
> I'm working on building a multi-server cpanel package.
> Right now, I have the user homedirectories on a gluster filesystem, I
> use unison to sync certain cpanel files.
> Presently, I have to copy the httpd.conf config (changing IP's) to
> the other server, along with new  password,group,shadow entries when
> new accounts are created.
> I then have to add the other server IP address to their DNS record.
>
> This works pretty well for me and I have a load-balanced (via round
> robin DNS) cpanel setup.
>
> The goal is to automate all the processes I do manually, and then
> I'll have a situation where I can have one cpanel installation and
> scale it across an infinite number of servers.
>
> I assume it will work similarly with plesk or any other control panel.
>
> I've also set php.ini to use a temp folder on the gluster filesyatem,
> instead of /tmp so that user sessions are shared amongst the
> machines--this way if the browser bounces to the other server, the
> user's session doesn't disappear.
>

It's great to hear that you are using cpanel based on gluster, I read in
"WHO's USING GLUSTER", whether it is you who is trying to offer cpanel
hosting based on gluster file system?

>>I hope to hear more thoughts on this (single Gluster client accessing
>>multiple Glusterfs servers)
>
> in my cpanel configuration, I have 2 servers..  each AFR to
> eachother.  I set local read volume to the local disk.
>
> It would work jsut as well to have multiple servers and a single (or
> multiple) cpanel client(s).
>

If it works fine with single gluster client and multiple gluster servers,
it will be helpful as I can start with one gluster client. Otherwise, if
it demands more gluster clients (I can use round robin method), but
running multiple clients on multiple dedicated servers is not cost
effective in this time.

>>I had observed different HA solutions like mysql replication, drbd setup,
>>cluster, other commercial mysql High Availability options too, not able
>> to
>>decide which way to go. In one point, I felt interested to try to use
>>HYPERTABLE (http://www.hypertable.org ) hosted on glusterFS, but as it is
>>young and as I donot have further info about it's php api and similar
>>reasons, I currently stick to MySQL only. Since I wish to use Memcache, I
>>am starting with one dedicated server for webserver and database server
>>together along with gluster client.
>
> I would shy away from any database using shared storage.  I'm not
> sure they're mature enough, and I think there may be unpleasant
> performance issues related to the speed of the locking mechanism.
>
> If you're really worried, you could run mysql cluster, however, as
> far as I know, this is still an In Memory database, which wont give
> you much space for you database.
>

agreed, I am in dilemma actually between DRBD and mysql replication setup.
Anyhow, finally, to keep things small to start, I had finally decided to
start with one server for mysql and as I use memcache, I think, it will
save from the heavy load of mysql requests atleast for some upcoming
months I hope.

> I can't say my mysql replication setup is trouble free, but it's
> pretty dependable.
> I have some scripts which monitor the slave status and notify me if
> the replication breaks, I check and often it's just a matter of
> skipping one statement and restarting the slave process.
> There have been cases where I had to copy certain database tables
> from one machine to the other.
>
> I'm not sure what affect it'll have on performance, but you may be
> able to run mysql over gluster only to have a kind of live/hot backup
> of the database... but I'm not sure how it'll work in practice, and
> it wont protect against data corruption.
>
>>please tell me, can we use ALU (least connections method) and round robin
>>translators together or we need to use only one translator?
>
> I'm not sure.. hopefully one of the dev's can shed light.
> I *blieve* you can intermix the translators almost any way you
> want... but I'm not sure.
>

why I asked is, Round Robin is a method which just plainly distributes the
requests and ALU (least connection method) distributes request to the
server with least number of connections. So I thought about them.

>>which translators you generally use?
>
> in my configuration, I use posix locks, io-threads, AFR (with local
> read volume).
> I'm not using any of the other performance related translators, as I
> just dont understand them well enough to know if I'll get benefit
> from them in my configuration.
>
thanks for info about your setup

>>I currently use round robin method for routing dns requests shared by my
>>download servers.
>
> this is how my cpanel servers are set up.
>
>>i wish to know, which one will be more effective when we go with
>> GlusterFS
>>servers.
>
> round robin is easiest.  as for effective, probalby springing for a
> real load balancer which has load monitoring daemons on the
> webservers will get you the best results, but It really depends on
> your situation.
> if you're processing data which can be cpu intensive, then this is
> the best option, if you're just serving normal web pages, then
> round-robin is fine.
> if you're streaming mpegs, then you'll want to be able to balance
> over network load or disk i/o.
>
> however, in any of those cases, I'd start with round robin and if you
> find it's insufficient, then spend the money an insert a load
> balancer of some sort.

my site is purely software download website, the current project.

>
>>I also like to know about Geographical replication setup using this AFR
>>method. For example, if I place two GlusterFS servers in one datacenter,
>>two glusterfs servers in another datacenter, can we use the same AFR
>> setup
>>for content replication effectively? and use geographical check in php
>> and
>>try to route user download request to nearest datacenter (having our
>>glusterfs servers) using my single gluster client?
>
> Currently, for each pair of serves. each server is in a different
> datacenter.  Currently both datacenters are in the same city,
> however, my ultimate plan is to move the servers to different geographic
> zones.
> The only concern I have here, would be how network latency affects
> gluster.
> My suspicion is that it's going to be just fine in my case, since
> there aren't a lot of file updates so gluster wont have to do much
> more chattering than the AFR auto-heal checking it normally does.
>

so, you feel comfortable with regular shared hosting kind of requirements
in your current experimentation,  right?

>>just some more thoughts: here, which translator will be used (either alu
>>or round robin or both depends on configuration setup and our main http
>>download request will be to the Gluster Client, which selects particular
>>GlusterFS server (based on backend configuration) and deliver the
>> software
>>file know?
>>
>>how it will be, if we host multiple gluster clients in multiple servers,
>>inwhich situation, if we use round robin method to input "file download
>>request" to a gluster client among the list of gluster clients from
>> which,
>>based on the default selected translator (ALU : least connections method)
>>for example, glusterfs server is selected and file delivered accordingly,
>>what do you say sir, will this method work?
>
> I'm not sure how to answer.. again,I'd recommend you hire the
> gluster.com folks to help with your implementation design, however...
> I would think you could just AFR 2-3 servers, on the clients, round
> robin is fine.  add some disk to the clients so you can use the
> caching translator to speed up subsequent requests, and you should be
> doing alright.
>
>

this is a really nice input, "adding more disk space in client and using
it to cache and server future requests. when I read this, I find it
similar to memcache (inwhich, memcache server hosts the cache from mysql
database in RAM)".

>>may be, I need to get this point done correctly (apache virtual hosts
>>using gluster mount point correctly)
>
> all my apache virtual hosts point to /home/USER/public_html
> /home is my gluster mountpoint.
>
> the other files which cpanel uses for user info I sync periodically
> through UNISON via cron.
>

try csync2, it can sync to any number of hosts:

http://oss.linbit.com/csync2/

>>I am ok with "IP based Auth" the only worry I have is about "hotlinking",
>>other than that, I am fine ok.
>
> hotlink protection is handled by apache.  I wouldn't worry about
> someone trying to "mount" your gluster filesystem by spoofing the IP.
>
> a future solution would be to add in an encryption translator later
> when one is available.
>
> (I'd love to see a compression translator, but new filesystems do
> this for you (zfs), so maybe it doesn't need to happen at the gluster
> level)
>
>>what I mean is to discuss the different doubts and once finalised, try to
>>write them together and ask for a quote for initial implementation. Just
>> I
>>am trying to get answers to my newbie questions for clear linkup and how
>>communication occur between web server, gluster client, glusterfs servers
>>etc all info, when my request to a consultant can be meaningful, I mean,
>>they can more perfectly understand what I require, I hope.
>
> good plan
>
>>thanks you guys, both Daniel and you keith for your valuable thoughts. I
>>wish to get some more clarity on other points that I had mentioned above,
>>
>>thank you guys :)
>>
>>With Best Regards
>>Raghu Veer
>
>
> very welcome.
>
>
how it will be if I refer this email in the mailing list to the gluster
support team when trying to explain my requirement?

thank you for your inputs

With Best Regards
Raghu Veer