[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

Sat Feb 27 17:03:17 UTC 2010

Greetings,

in contrast to some performance tips regarding small file *read*
performance, I want to share these results. The test is rather simple
but yields some very remarkable results: 400% improved read performance
by simply dropping some of the so called "performance translators"!

Please note that this test resembles a simplified version of our
workload, which is more or less sequential, read-only small file serving
with an average of 100 concurrent clients. (We use GlusterFS as a
flat-file backend to a cluster of webservers, which is hit only after
missing some caches in a more sophisticated caching infrastructure on
top of it)

The test-setup is a 3 node AFR cluster, with server+client on each one,
single process model (one volfile, the local volume is attached to
within the same process to save overhead), connected via 1 Gbit
Ethernet. This way each node can continue to operate on it's own, even
if the whole internal network for GlusterFS is down.

We used commodity hardware for the test. Each node is identical:
- Intel Core i7
- 12G RAM
- 500GB filesystem
- 1 Gbit NIC dedicated for GlusterFS

Software:
- Linux 2.6.32.8
- GlusterFS 3.0.2
- FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
- Filesystem / Storage Backend:
  - LVM2 on top of software RAID 1
  - ext4 with noatime

I will paste the configurations inline, so people can comment on them.

/etc/fstab:
-------------------------------------------------------------------------
/dev/data/test /mnt/brick/test  ext4    noatime   0 2

/etc/glusterfs/test.vol  /mnt/glusterfs/test  glusterfs
noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0
-------------------------------------------------------------------------

***
Please note: this is the final configuration with the best results. All
translators are numbered to make the explanation easier later on. Unused
translators are commented out...
The volume spec is identical on all nodes, except that the bind-address
option in the server volume [*4*] is adjusted.
***

/etc/glusterfs/test.vol
-------------------------------------------------------------------------
# Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com>
#
# Single Process Model with AFR (Automatic File Replication).

##
## Storage backend
##

#
# POSIX STORAGE [*1*]
#
volume posix
  type storage/posix
  option directory /mnt/brick/test/glusterfs
end-volume

#
# POSIX LOCKS [*2*]
#
#volume locks
volume brick
  type features/locks
  subvolumes posix
end-volume

##
## Performance translators (server side)
##

#
# IO-Threads [*3*]
#
#volume brick
#  type performance/io-threads
#  subvolumes locks
#  option thread-count 8
#end-volume

### End of performance translators

#
# TCP/IP server [*4*]
#
volume server
  type protocol/server
  subvolumes brick
  option transport-type tcp
  option transport.socket.bind-address 10.1.0.1   # FIXME
  option transport.socket.listen-port 820
  option transport.socket.nodelay on
  option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3
end-volume

#
# TCP/IP clients [*5*]
#
volume node1
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.1
  option remote-port 820
  option transport.socket.nodelay on
end-volume

volume node2
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.2
  option remote-port 820
  option transport.socket.nodelay on
end-volume

volume node3
  type protocol/client
  option remote-subvolume brick
  option transport-type tcp/client
  option remote-host 10.1.0.3
  option remote-port 820
  option transport.socket.nodelay on
end-volume

#
# Automatic File Replication Translator (AFR) [*6*]
#
# NOTE: "node3" is the primary metadata node, so this one *must*
#       be listed first in all volume specs! Also, node3 is the global
#       favorite-child with the definite file version if any conflict
#       arises while self-healing...
#
volume afr
  type cluster/replicate
  subvolumes node3 node1 node2
  option read-subvolume node2
  option favorite-child node3
end-volume

##
## Performance translators (client side)
##

#
# IO-Threads [*7*]
#
#volume client-threads-1
#  type performance/io-threads
#  subvolumes afr
#  option thread-count 8
#end-volume

#
# Write-Behind [*8*]
#
volume wb
  type performance/write-behind
  subvolumes afr
  option cache-size 4MB
end-volume

#
# Read-Ahead [*9*]
#
#volume ra
#  type performance/read-ahead
#  subvolumes wb
#  option page-count 2
#end-volume

#
# IO-Cache [*10*]
#
volume cache
  type performance/io-cache
  subvolumes wb
  option cache-size 1024MB
  option cache-timeout 60
end-volume

#
# Quick-Read for small files [*11*]
#
#volume qr
#  type performance/quick-read
#  subvolumes cache
#  option cache-timeout 60
#end-volume

#
# Metadata prefetch [*12*]
#
#volume sp
#  type performance/stat-prefetch
#  subvolumes qr
#end-volume

#
# IO-Threads [*13*]
#
#volume client-threads-2
#  type performance/io-threads
#  subvolumes sp
#  option thread-count 16
#end-volume

### End of performance translators.
-------------------------------------------------------------------------

So let's start now. If not explicitely stated, perform on all nodes:

# Prepare filesystem mountpoints
$ mkdir -p /mnt/brick/test

# Mount bricks
$ mount /mnt/brick/test

# Prepare brick roots (so lost+found won't end up in the volume)
$ mkdir -p /mnt/brick/test/glusterfs

# Load FUSE
$ modprobe fuse

# Prepare GlusterFS mountpoints
$ mkdir -p /mnt/glusterfs/test

# Mount GlusterFS
# (we start with Node 3 which should become the metadata master)
node3 $ mount /mnt/glusterfs/test
node1 $ mount /mnt/glusterfs/test
node2 $ mount /mnt/glusterfs/test

# While doing the tests, we watch the logs on all nodes for errors:
$ tail -f /var/log/glusterfs/test.log

For each volume spec change, you have to unmount GlusterFS, change the
vol file, and mount GlusterFS again. Before starting tests, make sure
everything is running and the volumes on all nodes are attached (watch
the log files!).

Write the test-data for the read-only tests. These are lot's of 20K
files, which resemble most of our css/js/php/python files. You should
adjust this to match your workload...
-------------------------------------------------------------------------
#!/bin/bash
mkdir -p /mnt/glusterfs/test/data
cd /mnt/glusterfs/test/data
for topdir in x{1..100}
do
    mkdir -p $topdir
    cd $topdir
    for subdir in y{1..10}
    do
        mkdir $subdir
        cd $subdir
        for file in z{1..10}
        do
            dd if=/dev/zero of=20K-$RANDOM \
            bs=4K count=5 &> /dev/null && echo -n .
        done
        cd ..
    done
    cd ..
done
-------------------------------------------------------------------------

OK, in our case /mnt/glusterfs/test/data is now populated with around
~240M of data... enough for some simple tests.

Each test-run consists of this simplified simulation of sequentially
reading all files, listing dirs and probably doing a stat():

-------------------------------------------------------------------------
$ cd /mnt/glusterfs/test/data

# Always populate the io-cache first:
$ time tar cf - . > /dev/null

# Simulate and time 100 concurrent data consumers:
$ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait
-------------------------------------------------------------------------

OK, so here are the results. As stated, take them with a grain of salt.
Make sure you resemble your workload. For example, read-ahead is as we
see useless in this case but might improve performance for files with a
different size... :)

# All translators active except *7* (client io-threads after AFR)
real    2m27.555s
user    0m3.536s
sys     0m6.888s

# All translators active except *13* (client io-threads at the end)
real    2m23.779s
user    0m2.824s
sys     0m5.604s

# All translators active except *7* and *13* (no client io-threads!)
real    0m53.097s
user    0m3.512s
sys     0m6.436s

# All translators active except *7*, *13* and only 8 io-threads in *3* #
instead of the default of 16 (server side io-threads)
real    0m45.942s
user    0m3.472s
sys     0m6.612s

# All translators active except *3*, *7*, *13* (no io-threads at all!)
real    0m40.332s
user    0m3.776s
sys     0m6.424s

# All translators active except *3*, *7*, *12*, *13* (no stat prefetch)
real    0m39.205s
user    0m3.672s
sys     0m6.084s

# All translators active except *3*, *7*, *11*, *12*, *13*
#  (no quickread)
real    0m39.116s
user    0m3.652s
sys     0m5.816s

# All translators active except *3*, *7*, *11*, *12*, *13* and
# with page-count = 2 in *9* instead of 4
real    0m38.851s
user    0m3.492s
sys     0m5.796s

# All translators active except *3*, *7*, *9*, *11*, *12*, *13*
#  (no read-ahead)
real    0m38.576s
user    0m3.356s
sys     0m6.076s

OK, that's it. Compare the results with all performance translators with
the final basic setup without any of the magic:

with all performance translators:	real    2m27.555s
without most performance translators: 	real    0m38.576s

This is a _HUGE_ improvement!

(disregard user and sys, they were practically the same in all tests)

Some final words:

- don't add performance translators blindly (!)
- always test with a similar workload you will use in production
- never go and copy+paste a volume spec, then moan about bad performance
- don't rely on "glusterfs-volgen", it gives you just a starting point!
- less translators == less overhead
- read documentation for all options of all translators and get an idea:
http://www.gluster.com/community/documentation/index.php/Translators
(some stuff is still undocumented, but this is open source... so have a
look)

Best regards,
John Feuerstein