[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
John Feuerstein
john at feurix.com
Sat Feb 27 17:03:17 UTC 2010
Greetings,
in contrast to some performance tips regarding small file *read*
performance, I want to share these results. The test is rather simple
but yields some very remarkable results: 400% improved read performance
by simply dropping some of the so called "performance translators"!
Please note that this test resembles a simplified version of our
workload, which is more or less sequential, read-only small file serving
with an average of 100 concurrent clients. (We use GlusterFS as a
flat-file backend to a cluster of webservers, which is hit only after
missing some caches in a more sophisticated caching infrastructure on
top of it)
The test-setup is a 3 node AFR cluster, with server+client on each one,
single process model (one volfile, the local volume is attached to
within the same process to save overhead), connected via 1 Gbit
Ethernet. This way each node can continue to operate on it's own, even
if the whole internal network for GlusterFS is down.
We used commodity hardware for the test. Each node is identical:
- Intel Core i7
- 12G RAM
- 500GB filesystem
- 1 Gbit NIC dedicated for GlusterFS
Software:
- Linux 2.6.32.8
- GlusterFS 3.0.2
- FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
- Filesystem / Storage Backend:
- LVM2 on top of software RAID 1
- ext4 with noatime
I will paste the configurations inline, so people can comment on them.
/etc/fstab:
-------------------------------------------------------------------------
/dev/data/test /mnt/brick/test ext4 noatime 0 2
/etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs
noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0
-------------------------------------------------------------------------
***
Please note: this is the final configuration with the best results. All
translators are numbered to make the explanation easier later on. Unused
translators are commented out...
The volume spec is identical on all nodes, except that the bind-address
option in the server volume [*4*] is adjusted.
***
/etc/glusterfs/test.vol
-------------------------------------------------------------------------
# Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com>
#
# Single Process Model with AFR (Automatic File Replication).
##
## Storage backend
##
#
# POSIX STORAGE [*1*]
#
volume posix
type storage/posix
option directory /mnt/brick/test/glusterfs
end-volume
#
# POSIX LOCKS [*2*]
#
#volume locks
volume brick
type features/locks
subvolumes posix
end-volume
##
## Performance translators (server side)
##
#
# IO-Threads [*3*]
#
#volume brick
# type performance/io-threads
# subvolumes locks
# option thread-count 8
#end-volume
### End of performance translators
#
# TCP/IP server [*4*]
#
volume server
type protocol/server
subvolumes brick
option transport-type tcp
option transport.socket.bind-address 10.1.0.1 # FIXME
option transport.socket.listen-port 820
option transport.socket.nodelay on
option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3
end-volume
#
# TCP/IP clients [*5*]
#
volume node1
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.1
option remote-port 820
option transport.socket.nodelay on
end-volume
volume node2
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.2
option remote-port 820
option transport.socket.nodelay on
end-volume
volume node3
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.3
option remote-port 820
option transport.socket.nodelay on
end-volume
#
# Automatic File Replication Translator (AFR) [*6*]
#
# NOTE: "node3" is the primary metadata node, so this one *must*
# be listed first in all volume specs! Also, node3 is the global
# favorite-child with the definite file version if any conflict
# arises while self-healing...
#
volume afr
type cluster/replicate
subvolumes node3 node1 node2
option read-subvolume node2
option favorite-child node3
end-volume
##
## Performance translators (client side)
##
#
# IO-Threads [*7*]
#
#volume client-threads-1
# type performance/io-threads
# subvolumes afr
# option thread-count 8
#end-volume
#
# Write-Behind [*8*]
#
volume wb
type performance/write-behind
subvolumes afr
option cache-size 4MB
end-volume
#
# Read-Ahead [*9*]
#
#volume ra
# type performance/read-ahead
# subvolumes wb
# option page-count 2
#end-volume
#
# IO-Cache [*10*]
#
volume cache
type performance/io-cache
subvolumes wb
option cache-size 1024MB
option cache-timeout 60
end-volume
#
# Quick-Read for small files [*11*]
#
#volume qr
# type performance/quick-read
# subvolumes cache
# option cache-timeout 60
#end-volume
#
# Metadata prefetch [*12*]
#
#volume sp
# type performance/stat-prefetch
# subvolumes qr
#end-volume
#
# IO-Threads [*13*]
#
#volume client-threads-2
# type performance/io-threads
# subvolumes sp
# option thread-count 16
#end-volume
### End of performance translators.
-------------------------------------------------------------------------
So let's start now. If not explicitely stated, perform on all nodes:
# Prepare filesystem mountpoints
$ mkdir -p /mnt/brick/test
# Mount bricks
$ mount /mnt/brick/test
# Prepare brick roots (so lost+found won't end up in the volume)
$ mkdir -p /mnt/brick/test/glusterfs
# Load FUSE
$ modprobe fuse
# Prepare GlusterFS mountpoints
$ mkdir -p /mnt/glusterfs/test
# Mount GlusterFS
# (we start with Node 3 which should become the metadata master)
node3 $ mount /mnt/glusterfs/test
node1 $ mount /mnt/glusterfs/test
node2 $ mount /mnt/glusterfs/test
# While doing the tests, we watch the logs on all nodes for errors:
$ tail -f /var/log/glusterfs/test.log
For each volume spec change, you have to unmount GlusterFS, change the
vol file, and mount GlusterFS again. Before starting tests, make sure
everything is running and the volumes on all nodes are attached (watch
the log files!).
Write the test-data for the read-only tests. These are lot's of 20K
files, which resemble most of our css/js/php/python files. You should
adjust this to match your workload...
-------------------------------------------------------------------------
#!/bin/bash
mkdir -p /mnt/glusterfs/test/data
cd /mnt/glusterfs/test/data
for topdir in x{1..100}
do
mkdir -p $topdir
cd $topdir
for subdir in y{1..10}
do
mkdir $subdir
cd $subdir
for file in z{1..10}
do
dd if=/dev/zero of=20K-$RANDOM \
bs=4K count=5 &> /dev/null && echo -n .
done
cd ..
done
cd ..
done
-------------------------------------------------------------------------
OK, in our case /mnt/glusterfs/test/data is now populated with around
~240M of data... enough for some simple tests.
Each test-run consists of this simplified simulation of sequentially
reading all files, listing dirs and probably doing a stat():
-------------------------------------------------------------------------
$ cd /mnt/glusterfs/test/data
# Always populate the io-cache first:
$ time tar cf - . > /dev/null
# Simulate and time 100 concurrent data consumers:
$ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait
-------------------------------------------------------------------------
OK, so here are the results. As stated, take them with a grain of salt.
Make sure you resemble your workload. For example, read-ahead is as we
see useless in this case but might improve performance for files with a
different size... :)
# All translators active except *7* (client io-threads after AFR)
real 2m27.555s
user 0m3.536s
sys 0m6.888s
# All translators active except *13* (client io-threads at the end)
real 2m23.779s
user 0m2.824s
sys 0m5.604s
# All translators active except *7* and *13* (no client io-threads!)
real 0m53.097s
user 0m3.512s
sys 0m6.436s
# All translators active except *7*, *13* and only 8 io-threads in *3* #
instead of the default of 16 (server side io-threads)
real 0m45.942s
user 0m3.472s
sys 0m6.612s
# All translators active except *3*, *7*, *13* (no io-threads at all!)
real 0m40.332s
user 0m3.776s
sys 0m6.424s
# All translators active except *3*, *7*, *12*, *13* (no stat prefetch)
real 0m39.205s
user 0m3.672s
sys 0m6.084s
# All translators active except *3*, *7*, *11*, *12*, *13*
# (no quickread)
real 0m39.116s
user 0m3.652s
sys 0m5.816s
# All translators active except *3*, *7*, *11*, *12*, *13* and
# with page-count = 2 in *9* instead of 4
real 0m38.851s
user 0m3.492s
sys 0m5.796s
# All translators active except *3*, *7*, *9*, *11*, *12*, *13*
# (no read-ahead)
real 0m38.576s
user 0m3.356s
sys 0m6.076s
OK, that's it. Compare the results with all performance translators with
the final basic setup without any of the magic:
with all performance translators: real 2m27.555s
without most performance translators: real 0m38.576s
This is a _HUGE_ improvement!
(disregard user and sys, they were practically the same in all tests)
Some final words:
- don't add performance translators blindly (!)
- always test with a similar workload you will use in production
- never go and copy+paste a volume spec, then moan about bad performance
- don't rely on "glusterfs-volgen", it gives you just a starting point!
- less translators == less overhead
- read documentation for all options of all translators and get an idea:
http://www.gluster.com/community/documentation/index.php/Translators
(some stuff is still undocumented, but this is open source... so have a
look)
Best regards,
John Feuerstein
More information about the Gluster-users
mailing list