[Gluster-devel] Re; Load balancing ...

Fri Apr 25 19:00:56 UTC 2008

>The impression I got from the site is that it isn't yet very mature, but 
>is usable. IMO, it stops way short of the "Gluster is wonderful and does 
>everything" claim.

Ok, can I check we're on the same page ? ;
http://zresearch.com/

Maybe I'm reading more into the FAQ/documentation than is there ... but there is much documentation about what apparently works and no warnings about the pitfalls / bugs that mean if you try to use this in practice you'll fall on your head.

There is a distinct implication for example that Gluster will wipe the floor with NFS for example, which is simply not true. You can apply a benchmark which under certain circumstances will make Gluster look quick, if on the other hand you try some alternative benchmarks, say loading a real-world application like Zope .. NFS flattens Gluster.

[how do I know this? because I tried to move my zope apps from NFS to Gluster]

.. and I did tune Gluster and did get really good throughput performance .. 
.. it's apparently metadata access that's letting it down ..
.. and even if there's more tuning I could do, it's not even close (!)

>How many nodes have you got? Have you tried running it with RHCS+GFS in an 
>otherwise similar setup? If so, how did the performance compare?

I have tried GFS, I'm not happy with stability under Xen/Ubuntu.
I've also tried OCFS2, same issues.

>The fact that the file may have been deleted or modified when you try to 
>open it. File's content is a feature of the file. Whether the file is 
>there and/or up to date is a feature of the metadata of the file and it's 
>parent directory. If you start loosening this, you might as well 
>disconnect the nodes and run them in a deliberate split-brain case and 
>resync periodically with all the conflict and data loss that entails.

Whoa, hold up .. I'm specifically looking at metadata information, not trying to open the file. Even when you come to open the file, you may not be worried about the file being 1ms out of sync with the writer ... for example if I'm doing a;
find / gluster -grep -Hi "spme string" {} \;

I'd far rather it complete in 10 seconds and risk the 1ms .. than wait a couple of minutes.

Let's say I have my root filesystem running on Gluster (yes I know it's not bootable atm, but when they fix fuse mmap it should be) then I will have lots and lots of files that I want to scan / open / run that I will hardly ever change .. even better, let's say I want to keep a hot-swap mirror of my filesystem, technically gluster can easily do this .. so I use one machine as RW and use another to keep a copy .. then if my machine blows up, I can just switch to the second machine with no data loss. Well, I can't because with meta-data working the way it does, it's simply too slow.

The documentation says when configuring Gluster "think like a programmer". Yet when I want to configure Gluster for a particular purpose, knowing full well the risks and that I can live with them, all of a sudden it's breaking the rules (!)

[..]

I've tried pretty much every combination .. my current test setup is two data servers and one client running client AFR. I must admit however that I've not tried -a or -e on the client, but then the issue relates to the fact that it's querying both servers, not how long the client caches the information for ...

I run DRBD to replicate my filesystem data .. currently on about 8 machines and about 20 DRBD volumes. This is all live and high volume, and after trying other approaches recently, I simply can't find anything that will do the job reliably, so using it is a bit of a no-brainer.

That said I do not use any fencing or clustering, I simply do a manual switch in the event of a problem. I've been running this for many years and have never had an issue re; split brain or exploding servers - hence it's a usable solution, despite warnings and theories, it works where nothing else appears to.

The gluster / fuse base looks to be great ... what I'm probably going to do, once I get a little time is try to build an alternative to AFR for use by us mugs who are prepared to risk the wrath of a the great God Posixlock ... ;-)

Gareth.

----- Original Message -----
From: gordan at bobich.net
To: gluster-devel at nongnu.org
Sent: Friday, April 25, 2008 4:29:15 PM GMT +00:00 GMT Britain, Ireland, Portugal
Subject: Re: [Gluster-devel] Re; Load balancing ...

On Fri, 25 Apr 2008, Gareth Bult wrote:

> Well here's the thing. I've tried to apply Gluster in 8 different "real 
> world" scenario's, and each time I've failed either because of bugs or 
> because "this simply isn't what GlusterFS is designed for".

[...]

> Suggesting that I'm either not tuning it properly or should be using an 
> alternative filesystem I'm afraid is a bit of a cop-out. There are real 
> problems here and saying "yes but Gluster is only designed to work in 
> specific instances" is frankly a bit daft, and if this were the case, 
> instead of a heavy sales pitch on the website along the lines of 
> "Gluster is wonderful and does everything", it should be saying "Gluster 
> will do x, y and z, only."

The impression I got from the site is that it isn't yet very mature, but 
is usable. IMO, it stops way short of the "Gluster is wonderful and does 
everything" claim.

> Now, Zope is a long-standing web based application server that I've been 
> using for nearly 10 years, telling me it's "excessive" really doesn't 
> fly. Trying to back up a gluster AFR with rsync runs into similar 
> problems when you have lots of small files - it takes way longer than it 
> should do.

How many nodes have you got? Have you tried running it with RHCS+GFS in an 
otherwise similar setup? If so, how did the performance compare?

> Moving to the other end of the scale, AFR can't cope with large files 
> either .. handling of sparse files doesn't work properly and self-heal 
> has no concept of repairing part of a file .. so sticking a 20Gb file on 
> a GlusterFS is just asking for trouble as every time you restart a 
> gluster server (or every time one crashes) it'll crucify your network.

I thought about this, and there isn't really a way to do anything about 
this, unless you relax the constraints. You could to a rsync-type rolling 
checksum block-sync, but this would both take up more CPU time and result 
in theoretical scope for the file to not be the same on both ends. Whether 
this minute possibility of corruption that the hashing algorithm doedn't 
pick up is a reasonable trade-off, I don't know. Perhaps if such a thing 
were implemented it should be made optional.

> Now, a couple of points;
>
> a. With regards to metadata, given two volumes mirrored via AFR, please can you
>   explain to me why it's ok to do a data read operation against one
>   node only, but not a metadata read operation .. and what would break
>   if you read metadata from only one volume?

The fact that the file may have been deleted or modified when you try to 
open it. File's content is a feature of the file. Whether the file is 
there and/or up to date is a feature of the metadata of the file and it's 
parent directory. If you start loosening this, you might as well 
disconnect the nodes and run them in a deliberate split-brain case and 
resync periodically with all the conflict and data loss that entails.

> b. Looking back through the list, Gluster's non-caching mechanism for
>    acquiring file-system information seems to be at the root of many of
>    it's performance issues. Is there no mileage in trying to address
>    this issue ?

How would you propose to obtain the full posix locking/consistency without 
this? Look at the similar alternatives like DRBD + [GFS | OCFS2]. They 
either require shared storage (SAN) or block level replicated FS (DRBD). 
Split-braining in those cases is a non-option, and you need 100% 
functional fencing to forcefully disable the failed node or risk extensive 
corruption. GlusterFS being file-based works around the risk of trashing 
the entire FS on the block device. Having shared/replicated storage block 
device works around a part of the problem because all the underlying data 
is replicated, but you'll find that GFS and OCFS2 also suffer similar 
performance penalties with lots of small files due to locking, especially 
on directory level. If anything, the design of GlusterFS is better for 
that scenario.

Since in GFS there is no scope for split-brain operation, you can 
guarantee that everything that was written is what is accessible. This the 
main source of contention is the write-locks. In GlusterFS the split-brain 
requirement is relaxed, but to compensate for this in order to maintain FS 
consistency, the metadata has to be checked each time. If you need this 
relaxed further, then you have to move away from the posix locking 
requirements, which puts you out of the realm of GlusterFS use-cases and 
into a more WAN-directed FS like Coda.

> c. If I stop one of my two servers, AFR suddenly speeds up "a lot" !
>   Would it be so bad if there were an additional option "subvolume-read-meta" ?
>   This would probably involve only a handful of additional lines of code, if that .. ?

How are your clients and servers organized? Are you using server-server 
based AFR? Or do you have clients doing the AFR-ing? Do you have more 
clients than servers? Have you tried adjusting the timeout options to 
glusterfs (-a, -e)?

Gordan

_______________________________________________
Gluster-devel mailing list
Gluster-devel at nongnu.org
http://lists.nongnu.org/mailman/listinfo/gluster-devel