[Gluster-users] AFR write performance

Fri Nov 20 22:28:18 UTC 2009

Greetings,

I am having what I perceive to be AFR performance problems.  Before I 
get to that, I will briefly describe the setup...

/** Setup **/

I have glusterfs up and running using the gluster optimized fuse module 
on the latest Centos 5 kernel (2.6.18-164.6.1.el5 #1 SMP) running on two 
machines (server and client volume configurations are below).  Both the 
server and client run on both machines.  Both servers are connected by a 
single CAT6 cable running directly into the Gigabit NICs dedicated to 
this task (no switch is used).   My goal is simply to mirror files 
across both servers.  As far as the files themselves, it is mixed, but 
there are many and about 90% of them are under 50K.  Each server runs a 
Quad core Q6600 processor with 8GB or RAM.  The disks are quite speedy - 
running 15K RPM SAS drives hooked to a 3ware controller (RAID 5 with a 
512MB cache).  The filesystem is ext3 mounted with noatime.  Writing 
directly to the ext3 partition with dd if=/dev/zero of=/sites/disktest 
bs=1M count=2048 yields 2147483648 bytes (2.1 GB) copied, 4.68686 
seconds, 458 MB/s.  Kernel optimizations on both servers outside of a 
stock CentOS 5 setup include:

3ware controller specific to avoid iowait latency under load:

echo 64 > /sys/block/sda/queue/max_sectors_kb
/sbin/blockdev --setra 8192 /dev/sda
echo 128 > /sys/block/sda/queue/nr_requests
echo 64 > /sys/block/sda/device/queue_depth
echo 10 > /proc/sys/vm/swappiness
echo 16 > /proc/sys/vm/page-cluster
echo 2 > /proc/sys/vm/dirty_background_ratio
echo 40 > /proc/sys/vm/dirty_ratio

Tweaks for better network performance (sysctl.conf):

net/core/rmem_max = 8738000
net/core/wmem_max = 8738000

net/ipv4/tcp_rmem = 8192 873800 8738000
net/ipv4/tcp_wmem = 4096 873800 8738000

/** Gluster Results **/

It should be noted that for the below test results I did not see high 
CPU or IOwait times during the tests.  Also, there are no other active 
processes running on either server.  Doing a simple write test using "dd 
if=/dev/zero of=/sites/sites/glustertest bs=1M count=2048" I am seeing:

2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 27.8451 seconds, 77.1 MB/s

Which is acceptable for my purposes.  I expected around 80 MB/s with the 
gigabit NICs being the obvious bottleneck.  So for a more real-world 
test using the actual files to be clustered, I took a small subset of 
the files (22016 of them - 440M in total) and extracted them from a 
tarball onto the /sites/sites (mount point for the tests) replicated 
cluster.  It took 17m28.972s to extract all files.  By way of comparison 
it takes 0m5.102s when extracting just to the ext3 partition.  Here are 
some unlinking times (for the 22016 files) - gluster mount: 0m28.428s 
ext3: 0m0.456s.  And here are some read times: gluster mount: 
0m19.871s.  You can note from my config that these times are with 
"option flush-behind on" for write-behind.  During the write test, I 
monitored NIC stats on the receiving server to see how much the link was 
utilized - its peak was 4.80Mb - so the NIC was not the bottleneck 
either.  I just cannot find the hold up, the network, disks, and cpu are 
not loaded during the write test.

So the biggest issue seems to be AFR write performance.  Is this normal 
or is there something specific to my setup causing these problems?  
Obviously I am new to glusterfs so I do not know what to expect, but I 
think I must be doing something wrong.

Any help/advice/direction is greatly appreciated.  I have googled and 
googled and found no advice that has yielded real results.  Sorry if I 
missed something obvious that was documented.

Michael

Volume files (same on each server) were first created using the 
/usr/bin/glusterfs-volgen --raid 1 --cache-size 512MB --export-directory 
/sites_gfs --name sites1 172.16.0.1 172.16.0.2

/** Server - adapted from generated to add one other directory **/

volume posix_sites
  type storage/posix
  option directory /sites_gfs
end-volume

volume posix_phplib
  type storage/posix
  option directory /usr/local/lib/php_gfs
end-volume

volume locks_sites
    type features/locks
    subvolumes posix_sites
end-volume

volume locks_phplib
    type features/locks
    subvolumes posix_phplib
end-volume

volume brick_sites
    type performance/io-threads
    option thread-count 8
    subvolumes locks_sites
end-volume

volume brick_phplib
    type performance/io-threads
    option thread-count 8
    subvolumes locks_phplib
end-volume

volume server
    type protocol/server
    option transport-type tcp
    option auth.addr.brick_sites.allow *
    option auth.addr.brick_phplib.allow *
    option listen-port 6996
    subvolumes brick_sites brick_phplib
end-volume

/** Client - adapted from generated to try and fix write issues - to no 
avail**/

volume 172.16.0.1
    type protocol/client
    option transport-type tcp
    option remote-host 172.16.0.1
    option remote-port 6996
    option remote-subvolume brick_sites
end-volume

volume 172.16.0.2
    type protocol/client
    option transport-type tcp
    option remote-host 172.16.0.2
    option remote-port 6996
    option remote-subvolume brick_sites
end-volume

volume mirror-0
    type cluster/replicate
    subvolumes 172.16.0.1 172.16.0.2
end-volume

volume writebehind
    type performance/write-behind
    option cache-size 1MB
    option flush-behind on
    subvolumes mirror-0
end-volume

volume io-cache
    type performance/io-cache
    option cache-size 64MB
    subvolumes writebehind
end-volume