[Gluster-users] Gluster 1.3.10 Performance Issues

Thu Aug 7 15:20:46 UTC 2008

On Aug 7, 2008, at 2:04 AM, Keith Freedman wrote:
> so server side afr takes 220% longer than client side AFR
>
>

> If all my assumptions are true.  what might solve some of the  
> problem (this would help both client side and server side), is to  
> use additional network ports.
> Either the server replicates over a different port or the client  
> talks to the 2 servers over different ports.

Its not a complex test.  Its not a complex setup.  A glusterfs mounted  
partition set with client side AFR using the previously listed  
hardware over a dedicated GigE port was used to unpack:

$ ls -al linux-2.6.26.1.tar.bz2
-rw-r--r-- 1 daviesinc daviesinc 49459141 2008-08-01 19:04  
linux-2.6.26.1.tar.bz2

The system was not iobound on the network, nor cpu bound.  Neither  
server's cpu went above 3% for either gluster process, the network  
barely showed any activity and was at less than 12mb/second.

> It would be interesting for you to rerun your tests with a multi-nic  
> configuration in both scenarios.
>
> It's safe to assume that at any speed, more is better :)

so you believe that to untar/unbz2 a 49mb file in under 17 minutes, I  
need to bond 2 gigE connections?

> Depending on your port speeds, which I dont recall, but I think you  
> provided, your hardware disk configuration wont matter.  100BaseT  
> you can probably do just as well with a single drive as with a raid  
> 0, 1, or 0+1.  with 1000BaseT or faster you will want a drive  
> configuration that can sustain the data transfer you'll be needing.

this is done under clientside AFR, file is written to both machines.   
a 4.3GB file almost hits wirespeed between the nodes.

$ dd if=/dev/zero of=three bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 37.3989 s, 115 MB/s

$ time cp linux-2.6.26.1.tar.bz2 linux-2.6.26.1.tar.bz2.copy

real	0m0.573s
user	0m0.000s
sys	0m0.052s

I can copy the 49mb file in .57 seconds.

During the tar xjf, switch stats show almost 500pps, bandwidth almost  
hits 4mb/sec (megabits), cpu shows glusterfs and glusterfsd at 1% of  
the cpu each, bzip at roughly 2%, tar rarely shows up in top, but,  
when it does, its very close to the bottom of the page at 1%.

$ time tar xjf linux-2.6.26.1.tar.bz2

real	18m6.799s
user	0m12.877s
sys	0m1.416s

I'm not convinced that this is a network or hardware problem.

>
>
> Hope that wasn't confusing.
>
> At 10:05 PM 8/6/2008, Chris Davies wrote:
>> A continuation:
>>
>> I used XFS & MD raid 1 on the partitions for the initial tests.
>> I tested reiser3 and reiser4 with no significant difference
>> I reraided to MD Raid 0 with XFS and received some improvement
>>
>> I NFS mounted the partition and received bonnie++ numbers similar to
>> the best clientside AFR numbers I have been able to get, but,
>> unpacking the kernel using nfsv4/udp took 1 minute 47 seconds  
>> compared
>> with 12 seconds for the bare drive, 41 seconds for serverside AFR and
>> an average of 17 minutes for clientside AFR.
>>
>> If I turn off AFR, whether I mount the remote machine over the net or
>> use the local server's brick, tar xjf of a kernel takes roughly 29
>> seconds.
>>
>> Large files replicate almost at wire speed.  rsync/cp -Rp of a large
>> directory takes considerable time.
>>
>> Both QA releases I've attempted of 1.4.0 have broken within minutes
>> using my configurations.  1.4.0qa32 and 1.4.0qa33.  I'll turn debug
>> logs on and post summaries of those.
>>
>> On Aug 6, 2008, at 2:48 PM, Chris Davies wrote:
>>
>> > OS: Debian Linux/4.1, 64bit build
>> > Hardware: quad core xeon x3220, 8gb RAM, dual 7200RPM 1000gb WD  
>> Hard
>> > Drives, 750gb raid 1 partition set as /gfsvol to be exported, dual
>> > gigE, juniper ex3200 switch
>> >
>> > Fuse libraries: fuse-2.7.3glfs10
>> > Gluster: glusterfs-1.3.10
>> >
>> > Running bonnie++ on both machines results in almost identical  
>> numbers,
>> > eth1 is reserved wholly for server to server communications.  Right
>> > now, the only load on these machines comes from my testbed.   
>> There are
>> > four tests that give a reasonable indicator of performance.
>> >
>> > * loading a wordpress blog and looking at the line:
>> > <!-- 24 queries. 0.634 seconds. -->
>> > * dd if=/dev/zero of=/gfs/test/out bs=1M count=512
>> > * time tar xjf /gfs/test/linux-2.6.26.1.tar.bz2
>> > * /usr/sbin/bonnie++ /gfs/test/
>> >
>> > On the wordpress test, .3 seconds is typical.  On various gluster
>> > configurations I've received between .411 seconds (server side afr
>> > config below) and 1.2 seconds with some of the example
>> > configurations.  Currently, my clientside AFR config comes in at . 
>> 5xx
>> > seconds rather consistently.
>> >
>> > The second test on the clientside AFR results in 536870912 bytes  
>> (537
>> > MB) copied, 4.65395 s, 115 MB/s
>> >
>> > The third test is unpacking a kernel which has ranged from 28  
>> seconds
>> > using the Serverside AFR to 6+ minutes on some configurations.
>> > Currently the clientside AFR config comes in at about 17 minutes.
>> >
>> > The fourth test is a run of bonnie++ which varies from 36 minutes  
>> on
>> > the serverside AFR to the 80 minute run on the clientside AFR  
>> config.
>> >
>> > Current test environment is using both servers as clients &  
>> servers --
>> > if I can get reasonable performance, the existing machines will  
>> become
>> > clients and the servers will be split to their own platform, so, I
>> > want to make sure I am using tcp for connections to give as close  
>> to a
>> > real world deployment as possible.  This means I cannot run a  
>> client-
>> > only config.
>> >
>> > Baseline Wordpress returns .311-.399 seconds
>> > Baseline dd 536870912 bytes (537 MB) copied, 0.489522 s, 1.1 GB/s
>> > Baseline tar xjf of the kernel, real  0m12.164s
>> > Baseline Config bonnie++ run on the raid 1 partition: (echo data |
>> > bon_csv2txt for the text reporting)
>> >
>> > c1ws1,16G,
>> > 66470,97,93198,16,42430,6,60253,86,97153,7,381.3,0,16,7534,37,++++ 
>> +,++
>> > +,5957,23,7320,34,+++++,+++,4667,21
>> >
>> > So far, the best performance I could manage was Server Side AFR  
>> with
>> > writebehind/readahead on the server, with aggregate-size set to  
>> 0mb,
>> > and the client side running writebehind/readahead.  That resulted  
>> in:
>> >
>> > c1ws2,16G,
>> >  
>> 37636,50,76855,3,17429,2,60376,76,87653,3,158.6,0,16,1741,3,9683,6,2591,3,2030,3,9790,5,2369,3
>> >
>> > It was suggested in IRC that clientside AFR would be faster and  
>> more
>> > reliable, however, I've ended up with the following as the best
>> > results from multiple attempts:
>> >
>> > c1ws1,16G,
>> >  
>> 46041,58,76811,2,4603,0,59140,76,86103,3,132.4,0,16,1069,2,4795,2,1308,2,1045,2,5209,2,1246,2
>> >
>> > The bonnie++ run from the serverside AFR that resulted in the best
>> > results I've received to date took 34 minutes.  The latest  
>> clientside
>> > AFR bonnie run took 80 minutes.  Based on the website, I would  
>> expect
>> > to see better performance than drbd/GFS, but, so far that hasn't  
>> been
>> > the case.
>> >
>> > Its been suggested that I use unify-rr-afr.  In my current setup,  
>> it
>> > seems that to do that, I would need to break my raid set which is  
>> my
>> > next step in debugging this.  Rather than use Raid 1 on the  
>> server, I
>> > would have 2 bricks on each server which would allow the use of  
>> unify
>> > and the rr scheduler.
>> >
>> > glusterfs-1.4.0qa32 results in
>> > [Wed Aug 06 02:01:44 2008] [notice] child pid 14025 exit signal Bus
>> > error (7)
>> > [Wed Aug 06 02:01:44 2008] [notice] child pid 14037 exit signal Bus
>> > error (7)
>> >
>> > when apache (not mod_gluster) tries to serve files off the  
>> glusterfs
>> > partition.
>> >
>> > The main issue I'm having right now is file creation speed.  I  
>> realize
>> > that to create a file I need to do two network ops for each file
>> > created, but, it seems that something is horribly wrong in my
>> > configuration from the results in untarring the kernel.
>> >
>> > I've tried moving the performance translators around, but, some  
>> don't
>> > seem to make much difference on the server side, and the ones that
>> > appear to make some difference client side, don't seem to help the
>> > file creation issue.
>> >
>> > On a side note, zresearch.com, I emailed through your contact  
>> form and
>> > haven't heard back -- please provide a quote for generating the
>> > configuration and contact me offlist.
>> >
>> > ===/etc/gluster/gluster-server.vol
>> > volume posix
>> >     type storage/posix
>> >     option directory /gfsvol/data
>> > end-volume
>> >
>> > volume plocks
>> >   type features/posix-locks
>> >   subvolumes posix
>> > end-volume
>> >
>> > volume writebehind
>> >   type performance/write-behind
>> >   option flush-behind off    # default is 'off'
>> >   subvolumes plocks
>> > end-volume
>> >
>> > volume readahead
>> >   type performance/read-ahead
>> >   option page-size 128kB        # 256KB is the default option
>> >   option page-count 4           # 2 is default option
>> >   option force-atime-update off # default is off
>> >   subvolumes writebehind
>> > end-volume
>> >
>> > volume brick
>> >   type performance/io-threads
>> >   option thread-count 4  # deault is 1
>> >   option cache-size 64MB #64MB
>> >   subvolumes readahead
>> > end-volume
>> >
>> > volume server
>> >     type protocol/server
>> >     option transport-type tcp/server
>> >     subvolumes brick
>> >     option auth.ip.brick.allow 10.8.1.*,127.0.0.1
>> > end-volume
>> >
>> >
>> > ===/etc/glusterfs/gluster-client.vol
>> >
>> > volume brick1
>> >     type protocol/client
>> >     option transport-type tcp/client # for TCP/IP transport
>> >     option remote-host 10.8.1.9   # IP address of server1
>> >     option remote-subvolume brick    # name of the remote volume on
>> > server1
>> > end-volume
>> >
>> > volume brick2
>> >     type protocol/client
>> >     option transport-type tcp/client # for TCP/IP transport
>> >     option remote-host 10.8.1.10   # IP address of server2
>> >     option remote-subvolume brick    # name of the remote volume on
>> > server2
>> > end-volume
>> >
>> > volume afr
>> >    type cluster/afr
>> >    subvolumes brick1 brick2
>> > end-volume
>> >
>> > volume writebehind
>> >   type performance/write-behind
>> >   option aggregate-size 0MB
>> >   option flush-behind off    # default is 'off'
>> >   subvolumes afr
>> > end-volume
>> >
>> > volume readahead
>> >   type performance/read-ahead
>> >   option page-size 128kB        # 256KB is the default option
>> >   option page-count 4           # 2 is default option
>> >   option force-atime-update off # default is off
>> >   subvolumes writebehind
>> > end-volume
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org
>> > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>> >
>> > >
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>
>
> !DSPAM:1,489aa3b2286571187917547!
>