[Gluster-users] Gluster 1.3.10 Performance Issues

Wed Aug 6 18:48:14 UTC 2008

OS: Debian Linux/4.1, 64bit build
Hardware: quad core xeon x3220, 8gb RAM, dual 7200RPM 1000gb WD Hard  
Drives, 750gb raid 1 partition set as /gfsvol to be exported, dual  
gigE, juniper ex3200 switch

Fuse libraries: fuse-2.7.3glfs10
Gluster: glusterfs-1.3.10

Running bonnie++ on both machines results in almost identical numbers,  
eth1 is reserved wholly for server to server communications.  Right  
now, the only load on these machines comes from my testbed.  There are  
four tests that give a reasonable indicator of performance.

* loading a wordpress blog and looking at the line:

* dd if=/dev/zero of=/gfs/test/out bs=1M count=512
* time tar xjf /gfs/test/linux-2.6.26.1.tar.bz2
* /usr/sbin/bonnie++ /gfs/test/

On the wordpress test, .3 seconds is typical.  On various gluster  
configurations I've received between .411 seconds (server side afr  
config below) and 1.2 seconds with some of the example  
configurations.  Currently, my clientside AFR config comes in at .5xx  
seconds rather consistently.

The second test on the clientside AFR results in 536870912 bytes (537  
MB) copied, 4.65395 s, 115 MB/s

The third test is unpacking a kernel which has ranged from 28 seconds  
using the Serverside AFR to 6+ minutes on some configurations.   
Currently the clientside AFR config comes in at about 17 minutes.

The fourth test is a run of bonnie++ which varies from 36 minutes on  
the serverside AFR to the 80 minute run on the clientside AFR config.

Current test environment is using both servers as clients & servers --  
if I can get reasonable performance, the existing machines will become  
clients and the servers will be split to their own platform, so, I  
want to make sure I am using tcp for connections to give as close to a  
real world deployment as possible.  This means I cannot run a client- 
only config.

Baseline Wordpress returns .311-.399 seconds
Baseline dd 536870912 bytes (537 MB) copied, 0.489522 s, 1.1 GB/s
Baseline tar xjf of the kernel, real	0m12.164s
Baseline Config bonnie++ run on the raid 1 partition: (echo data | 
bon_csv2txt for the text reporting)

c1ws1,16G, 
66470,97,93198,16,42430,6,60253,86,97153,7,381.3,0,16,7534,37,+++++,++ 
+,5957,23,7320,34,+++++,+++,4667,21

So far, the best performance I could manage was Server Side AFR with  
writebehind/readahead on the server, with aggregate-size set to 0mb,  
and the client side running writebehind/readahead.  That resulted in:

c1ws2,16G, 
37636,50,76855,3,17429,2,60376,76,87653,3,158.6,0,16,1741,3,9683,6,2591,3,2030,3,9790,5,2369,3

It was suggested in IRC that clientside AFR would be faster and more  
reliable, however, I've ended up with the following as the best  
results from multiple attempts:

c1ws1,16G, 
46041,58,76811,2,4603,0,59140,76,86103,3,132.4,0,16,1069,2,4795,2,1308,2,1045,2,5209,2,1246,2

The bonnie++ run from the serverside AFR that resulted in the best  
results I've received to date took 34 minutes.  The latest clientside  
AFR bonnie run took 80 minutes.  Based on the website, I would expect  
to see better performance than drbd/GFS, but, so far that hasn't been  
the case.

Its been suggested that I use unify-rr-afr.  In my current setup, it  
seems that to do that, I would need to break my raid set which is my  
next step in debugging this.  Rather than use Raid 1 on the server, I  
would have 2 bricks on each server which would allow the use of unify  
and the rr scheduler.

glusterfs-1.4.0qa32 results in
[Wed Aug 06 02:01:44 2008] [notice] child pid 14025 exit signal Bus  
error (7)
[Wed Aug 06 02:01:44 2008] [notice] child pid 14037 exit signal Bus  
error (7)

when apache (not mod_gluster) tries to serve files off the glusterfs  
partition.

The main issue I'm having right now is file creation speed.  I realize  
that to create a file I need to do two network ops for each file  
created, but, it seems that something is horribly wrong in my  
configuration from the results in untarring the kernel.

I've tried moving the performance translators around, but, some don't  
seem to make much difference on the server side, and the ones that  
appear to make some difference client side, don't seem to help the  
file creation issue.

On a side note, zresearch.com, I emailed through your contact form and  
haven't heard back -- please provide a quote for generating the  
configuration and contact me offlist.

===/etc/gluster/gluster-server.vol
volume posix
     type storage/posix
     option directory /gfsvol/data
end-volume

volume plocks
   type features/posix-locks
   subvolumes posix
end-volume

volume writebehind
   type performance/write-behind
   option flush-behind off    # default is 'off'
   subvolumes plocks
end-volume

volume readahead
   type performance/read-ahead
   option page-size 128kB        # 256KB is the default option
   option page-count 4           # 2 is default option
   option force-atime-update off # default is off
   subvolumes writebehind
end-volume

volume brick
   type performance/io-threads
   option thread-count 4  # deault is 1
   option cache-size 64MB #64MB
   subvolumes readahead
end-volume

volume server
     type protocol/server
     option transport-type tcp/server
     subvolumes brick
     option auth.ip.brick.allow 10.8.1.*,127.0.0.1
end-volume

===/etc/glusterfs/gluster-client.vol

volume brick1
     type protocol/client
     option transport-type tcp/client # for TCP/IP transport
     option remote-host 10.8.1.9   # IP address of server1
     option remote-subvolume brick    # name of the remote volume on  
server1
end-volume

volume brick2
     type protocol/client
     option transport-type tcp/client # for TCP/IP transport
     option remote-host 10.8.1.10   # IP address of server2
     option remote-subvolume brick    # name of the remote volume on  
server2
end-volume

volume afr
    type cluster/afr
    subvolumes brick1 brick2
end-volume

volume writebehind
   type performance/write-behind
   option aggregate-size 0MB
   option flush-behind off    # default is 'off'
   subvolumes afr
end-volume

volume readahead
   type performance/read-ahead
   option page-size 128kB        # 256KB is the default option
   option page-count 4           # 2 is default option
   option force-atime-update off # default is off
   subvolumes writebehind
end-volume