[Gluster-devel] Re; Load balancing ...
gordan at bobich.net
gordan at bobich.net
Fri Apr 25 15:29:15 UTC 2008
On Fri, 25 Apr 2008, Gareth Bult wrote:
> Well here's the thing. I've tried to apply Gluster in 8 different "real
> world" scenario's, and each time I've failed either because of bugs or
> because "this simply isn't what GlusterFS is designed for".
[...]
> Suggesting that I'm either not tuning it properly or should be using an
> alternative filesystem I'm afraid is a bit of a cop-out. There are real
> problems here and saying "yes but Gluster is only designed to work in
> specific instances" is frankly a bit daft, and if this were the case,
> instead of a heavy sales pitch on the website along the lines of
> "Gluster is wonderful and does everything", it should be saying "Gluster
> will do x, y and z, only."
The impression I got from the site is that it isn't yet very mature, but
is usable. IMO, it stops way short of the "Gluster is wonderful and does
everything" claim.
> Now, Zope is a long-standing web based application server that I've been
> using for nearly 10 years, telling me it's "excessive" really doesn't
> fly. Trying to back up a gluster AFR with rsync runs into similar
> problems when you have lots of small files - it takes way longer than it
> should do.
How many nodes have you got? Have you tried running it with RHCS+GFS in an
otherwise similar setup? If so, how did the performance compare?
> Moving to the other end of the scale, AFR can't cope with large files
> either .. handling of sparse files doesn't work properly and self-heal
> has no concept of repairing part of a file .. so sticking a 20Gb file on
> a GlusterFS is just asking for trouble as every time you restart a
> gluster server (or every time one crashes) it'll crucify your network.
I thought about this, and there isn't really a way to do anything about
this, unless you relax the constraints. You could to a rsync-type rolling
checksum block-sync, but this would both take up more CPU time and result
in theoretical scope for the file to not be the same on both ends. Whether
this minute possibility of corruption that the hashing algorithm doedn't
pick up is a reasonable trade-off, I don't know. Perhaps if such a thing
were implemented it should be made optional.
> Now, a couple of points;
>
> a. With regards to metadata, given two volumes mirrored via AFR, please can you
> explain to me why it's ok to do a data read operation against one
> node only, but not a metadata read operation .. and what would break
> if you read metadata from only one volume?
The fact that the file may have been deleted or modified when you try to
open it. File's content is a feature of the file. Whether the file is
there and/or up to date is a feature of the metadata of the file and it's
parent directory. If you start loosening this, you might as well
disconnect the nodes and run them in a deliberate split-brain case and
resync periodically with all the conflict and data loss that entails.
> b. Looking back through the list, Gluster's non-caching mechanism for
> acquiring file-system information seems to be at the root of many of
> it's performance issues. Is there no mileage in trying to address
> this issue ?
How would you propose to obtain the full posix locking/consistency without
this? Look at the similar alternatives like DRBD + [GFS | OCFS2]. They
either require shared storage (SAN) or block level replicated FS (DRBD).
Split-braining in those cases is a non-option, and you need 100%
functional fencing to forcefully disable the failed node or risk extensive
corruption. GlusterFS being file-based works around the risk of trashing
the entire FS on the block device. Having shared/replicated storage block
device works around a part of the problem because all the underlying data
is replicated, but you'll find that GFS and OCFS2 also suffer similar
performance penalties with lots of small files due to locking, especially
on directory level. If anything, the design of GlusterFS is better for
that scenario.
Since in GFS there is no scope for split-brain operation, you can
guarantee that everything that was written is what is accessible. This the
main source of contention is the write-locks. In GlusterFS the split-brain
requirement is relaxed, but to compensate for this in order to maintain FS
consistency, the metadata has to be checked each time. If you need this
relaxed further, then you have to move away from the posix locking
requirements, which puts you out of the realm of GlusterFS use-cases and
into a more WAN-directed FS like Coda.
> c. If I stop one of my two servers, AFR suddenly speeds up "a lot" !
> Would it be so bad if there were an additional option "subvolume-read-meta" ?
> This would probably involve only a handful of additional lines of code, if that .. ?
How are your clients and servers organized? Are you using server-server
based AFR? Or do you have clients doing the AFR-ing? Do you have more
clients than servers? Have you tried adjusting the timeout options to
glusterfs (-a, -e)?
Gordan
More information about the Gluster-devel
mailing list