[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark

Wed Mar 17 12:40:39 UTC 2010

Hi John,

when stdout is redirected to /dev/null, tar on my laptop is not doing any
reads (tar cf - . > /dev/null). Can you confirm whether tar is having same
behaviour on your test setup? when redirected to any file other than
/dev/null, tar is doing reads. Can you attach strace of tar?

regards,
On Sat, Feb 27, 2010 at 9:03 PM, John Feuerstein <john at feurix.com> wrote:

> Greetings,
>
> in contrast to some performance tips regarding small file *read*
> performance, I want to share these results. The test is rather simple
> but yields some very remarkable results: 400% improved read performance
> by simply dropping some of the so called "performance translators"!
>
> Please note that this test resembles a simplified version of our
> workload, which is more or less sequential, read-only small file serving
> with an average of 100 concurrent clients. (We use GlusterFS as a
> flat-file backend to a cluster of webservers, which is hit only after
> missing some caches in a more sophisticated caching infrastructure on
> top of it)
>
> The test-setup is a 3 node AFR cluster, with server+client on each one,
> single process model (one volfile, the local volume is attached to
> within the same process to save overhead), connected via 1 Gbit
> Ethernet. This way each node can continue to operate on it's own, even
> if the whole internal network for GlusterFS is down.
>
> We used commodity hardware for the test. Each node is identical:
> - Intel Core i7
> - 12G RAM
> - 500GB filesystem
> - 1 Gbit NIC dedicated for GlusterFS
>
> Software:
> - Linux 2.6.32.8
> - GlusterFS 3.0.2
> - FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
> - Filesystem / Storage Backend:
>  - LVM2 on top of software RAID 1
>  - ext4 with noatime
>
> I will paste the configurations inline, so people can comment on them.
>
>
> /etc/fstab:
> -------------------------------------------------------------------------
> /dev/data/test /mnt/brick/test  ext4    noatime   0 2
>
> /etc/glusterfs/test.vol  /mnt/glusterfs/test  glusterfs
> noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0
> -------------------------------------------------------------------------
>
>
> ***
> Please note: this is the final configuration with the best results. All
> translators are numbered to make the explanation easier later on. Unused
> translators are commented out...
> The volume spec is identical on all nodes, except that the bind-address
> option in the server volume [*4*] is adjusted.
> ***
>
> /etc/glusterfs/test.vol
> -------------------------------------------------------------------------
> # Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com>
> #
> # Single Process Model with AFR (Automatic File Replication).
>
>
> ##
> ## Storage backend
> ##
>
> #
> # POSIX STORAGE [*1*]
> #
> volume posix
>  type storage/posix
>  option directory /mnt/brick/test/glusterfs
> end-volume
>
> #
> # POSIX LOCKS [*2*]
> #
> #volume locks
> volume brick
>  type features/locks
>  subvolumes posix
> end-volume
>
>
> ##
> ## Performance translators (server side)
> ##
>
> #
> # IO-Threads [*3*]
> #
> #volume brick
> #  type performance/io-threads
> #  subvolumes locks
> #  option thread-count 8
> #end-volume
>
> ### End of performance translators
>
>
> #
> # TCP/IP server [*4*]
> #
> volume server
>  type protocol/server
>  subvolumes brick
>  option transport-type tcp
>  option transport.socket.bind-address 10.1.0.1   # FIXME
>  option transport.socket.listen-port 820
>  option transport.socket.nodelay on
>  option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3
> end-volume
>
>
> #
> # TCP/IP clients [*5*]
> #
> volume node1
>  type protocol/client
>  option remote-subvolume brick
>  option transport-type tcp/client
>  option remote-host 10.1.0.1
>  option remote-port 820
>  option transport.socket.nodelay on
> end-volume
>
> volume node2
>  type protocol/client
>  option remote-subvolume brick
>  option transport-type tcp/client
>  option remote-host 10.1.0.2
>  option remote-port 820
>  option transport.socket.nodelay on
> end-volume
>
> volume node3
>  type protocol/client
>  option remote-subvolume brick
>  option transport-type tcp/client
>  option remote-host 10.1.0.3
>  option remote-port 820
>  option transport.socket.nodelay on
> end-volume
>
>
> #
> # Automatic File Replication Translator (AFR) [*6*]
> #
> # NOTE: "node3" is the primary metadata node, so this one *must*
> #       be listed first in all volume specs! Also, node3 is the global
> #       favorite-child with the definite file version if any conflict
> #       arises while self-healing...
> #
> volume afr
>  type cluster/replicate
>  subvolumes node3 node1 node2
>  option read-subvolume node2
>  option favorite-child node3
> end-volume
>
>
>
> ##
> ## Performance translators (client side)
> ##
>
> #
> # IO-Threads [*7*]
> #
> #volume client-threads-1
> #  type performance/io-threads
> #  subvolumes afr
> #  option thread-count 8
> #end-volume
>
> #
> # Write-Behind [*8*]
> #
> volume wb
>  type performance/write-behind
>  subvolumes afr
>  option cache-size 4MB
> end-volume
>
>
> #
> # Read-Ahead [*9*]
> #
> #volume ra
> #  type performance/read-ahead
> #  subvolumes wb
> #  option page-count 2
> #end-volume
>
>
> #
> # IO-Cache [*10*]
> #
> volume cache
>  type performance/io-cache
>  subvolumes wb
>  option cache-size 1024MB
>  option cache-timeout 60
> end-volume
>
> #
> # Quick-Read for small files [*11*]
> #
> #volume qr
> #  type performance/quick-read
> #  subvolumes cache
> #  option cache-timeout 60
> #end-volume
>
> #
> # Metadata prefetch [*12*]
> #
> #volume sp
> #  type performance/stat-prefetch
> #  subvolumes qr
> #end-volume
>
> #
> # IO-Threads [*13*]
> #
> #volume client-threads-2
> #  type performance/io-threads
> #  subvolumes sp
> #  option thread-count 16
> #end-volume
>
> ### End of performance translators.
> -------------------------------------------------------------------------
>
>
>
> So let's start now. If not explicitely stated, perform on all nodes:
>
> # Prepare filesystem mountpoints
> $ mkdir -p /mnt/brick/test
>
> # Mount bricks
> $ mount /mnt/brick/test
>
> # Prepare brick roots (so lost+found won't end up in the volume)
> $ mkdir -p /mnt/brick/test/glusterfs
>
> # Load FUSE
> $ modprobe fuse
>
> # Prepare GlusterFS mountpoints
> $ mkdir -p /mnt/glusterfs/test
>
> # Mount GlusterFS
> # (we start with Node 3 which should become the metadata master)
> node3 $ mount /mnt/glusterfs/test
> node1 $ mount /mnt/glusterfs/test
> node2 $ mount /mnt/glusterfs/test
>
> # While doing the tests, we watch the logs on all nodes for errors:
> $ tail -f /var/log/glusterfs/test.log
>
> For each volume spec change, you have to unmount GlusterFS, change the
> vol file, and mount GlusterFS again. Before starting tests, make sure
> everything is running and the volumes on all nodes are attached (watch
> the log files!).
>
>
> Write the test-data for the read-only tests. These are lot's of 20K
> files, which resemble most of our css/js/php/python files. You should
> adjust this to match your workload...
> -------------------------------------------------------------------------
> #!/bin/bash
> mkdir -p /mnt/glusterfs/test/data
> cd /mnt/glusterfs/test/data
> for topdir in x{1..100}
> do
>    mkdir -p $topdir
>    cd $topdir
>    for subdir in y{1..10}
>    do
>        mkdir $subdir
>        cd $subdir
>        for file in z{1..10}
>        do
>            dd if=/dev/zero of=20K-$RANDOM \
>            bs=4K count=5 &> /dev/null && echo -n .
>        done
>        cd ..
>    done
>    cd ..
> done
> -------------------------------------------------------------------------
>
> OK, in our case /mnt/glusterfs/test/data is now populated with around
> ~240M of data... enough for some simple tests.
>
> Each test-run consists of this simplified simulation of sequentially
> reading all files, listing dirs and probably doing a stat():
>
> -------------------------------------------------------------------------
> $ cd /mnt/glusterfs/test/data
>
> # Always populate the io-cache first:
> $ time tar cf - . > /dev/null
>
> # Simulate and time 100 concurrent data consumers:
> $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait
> -------------------------------------------------------------------------
>
>
> OK, so here are the results. As stated, take them with a grain of salt.
> Make sure you resemble your workload. For example, read-ahead is as we
> see useless in this case but might improve performance for files with a
> different size... :)
>
>
> # All translators active except *7* (client io-threads after AFR)
> real    2m27.555s
> user    0m3.536s
> sys     0m6.888s
>
> # All translators active except *13* (client io-threads at the end)
> real    2m23.779s
> user    0m2.824s
> sys     0m5.604s
>
> # All translators active except *7* and *13* (no client io-threads!)
> real    0m53.097s
> user    0m3.512s
> sys     0m6.436s
>
> # All translators active except *7*, *13* and only 8 io-threads in *3* #
> instead of the default of 16 (server side io-threads)
> real    0m45.942s
> user    0m3.472s
> sys     0m6.612s
>
> # All translators active except *3*, *7*, *13* (no io-threads at all!)
> real    0m40.332s
> user    0m3.776s
> sys     0m6.424s
>
> # All translators active except *3*, *7*, *12*, *13* (no stat prefetch)
> real    0m39.205s
> user    0m3.672s
> sys     0m6.084s
>
> # All translators active except *3*, *7*, *11*, *12*, *13*
> #  (no quickread)
> real    0m39.116s
> user    0m3.652s
> sys     0m5.816s
>
> # All translators active except *3*, *7*, *11*, *12*, *13* and
> # with page-count = 2 in *9* instead of 4
> real    0m38.851s
> user    0m3.492s
> sys     0m5.796s
>
> # All translators active except *3*, *7*, *9*, *11*, *12*, *13*
> #  (no read-ahead)
> real    0m38.576s
> user    0m3.356s
> sys     0m6.076s
>
>
> OK, that's it. Compare the results with all performance translators with
> the final basic setup without any of the magic:
>
> with all performance translators:       real    2m27.555s
> without most performance translators:   real    0m38.576s
>
> This is a _HUGE_ improvement!
>
> (disregard user and sys, they were practically the same in all tests)
>
>
> Some final words:
>
> - don't add performance translators blindly (!)
> - always test with a similar workload you will use in production
> - never go and copy+paste a volume spec, then moan about bad performance
> - don't rely on "glusterfs-volgen", it gives you just a starting point!
> - less translators == less overhead
> - read documentation for all options of all translators and get an idea:
> http://www.gluster.com/community/documentation/index.php/Translators
> (some stuff is still undocumented, but this is open source... so have a
> look)
>
>
> Best regards,
> John Feuerstein
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

-- 
Raghavendra G