[Gluster-users] Horrible performance with small files (DHT/AFR)

Fri Jun 5 06:23:29 UTC 2009

Hi Benjamin,
Can you provide us with the following information to narrow down the issue. 
1. The network configuration of the GlusterFS deployment and the bandwidth that is available
   for GlusterFS to use
2. Does performance also get affected the same way with large files too? Or is it with just small files
   as you have mentioned? Let us know the performance of GlusterFS with large files.

There haven't really been an issue with using large number of small files as such in the past. Nevertheless,
we'll look into this once you give us the above details.

Regards,
Pavan

On 04/06/09 16:21 -0400, Benjamin Krein wrote:
> Here are some more details with different configs:
>
> * Only AFR between cfs1 & cfs2:
> root at dev1# time cp -rp * /mnt/
>
> real	16m45.995s
> user	0m1.104s
> sys	0m5.528s
>
> * Single server - cfs1:
> root at dev1# time cp -rp * /mnt/
>
> real	10m33.967s
> user	0m0.764s
> sys	0m5.516s
>
> * Stats via bmon on cfs1 during above copy:
>   #   Interface                RX Rate         RX #     TX Rate         
> TX #
> ────────────────────────────────────────────────────────────── 
> ────────────────────────
> cfs1 (source: local)
>   0   eth1                     951.25KiB       1892     254.00KiB       
> 1633
>
> It gets progressively better, but that's still a *long* way from <2 min 
> times with scp & <1 min times with rsync!  And, I have no redundancy or 
> distributed hash whatsoever.
>
> * Client config for the last test:
> -----
> # Webform Flat-File Cache Volume client configuration
>
> volume srv1
> 	type protocol/client
> 	option transport-type tcp
> 	option remote-host cfs1
> 	option remote-subvolume webform_cache_brick
> end-volume
>
> volume writebehind
> 	type performance/write-behind
> 	option cache-size 4mb
>         option flush-behind on
> 	subvolumes srv1
> end-volume
>
> volume cache
> 	type performance/io-cache
> 	option cache-size 512mb
> 	subvolumes writebehind
> end-volume
> -----
>
> Ben
>
> On Jun 3, 2009, at 4:33 PM, Vahriç Muhtaryan wrote:
>
>> For better understanding issue did you try 4 servers DHT only or 2  
>> servers
>> DHT only or two servers replication only for find out real problem  
>> maybe
>> replication or dht could have a bug ?
>>
>> -----Original Message-----
>> From: gluster-users-bounces at gluster.org
>> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Benjamin Krein
>> Sent: Wednesday, June 03, 2009 11:00 PM
>> To: Jasper van Wanrooy - Chatventure
>> Cc: gluster-users at gluster.org
>> Subject: Re: [Gluster-users] Horrible performance with small files  
>> (DHT/AFR)
>>
>> The current boxes I'm using for testing are as follows:
>>
>>  * 2x dual-core Opteron ~2GHz (x86_64)
>>  * 4GB RAM
>>  * 4x 7200 RPM 73GB SATA - RAID1+0 w/3ware hardware controllers
>>
>> The server storage directories live in /home/clusterfs where /home is
>> an ext3 partition mounted with noatime.
>>
>> These servers are not virtualized.  They are running Ubuntu 8.04 LTS
>> Server x86_64.
>>
>> The files I'm copying are all <2k javascript files (plain text) stored
>> in 100 hash directories in each of 3 parent directories:
>>
>> /home/clusterfs/
>>   + parentdir1/
>>   |   + 00/
>>   |   | ...
>>   |   + 99/
>>   + parentdir1/
>>   |   + 00/
>>   |   | ...
>>   |   + 99/
>>   + parentdir1/
>>       + 00/
>>       | ...
>>       + 99/
>>
>> There are ~10k of these <2k javascript files distributed throughout
>> the above directory structure totaling approximately 570MB.  My tests
>> have been copying that entire directory structure from a client
>> machine into the glusterfs mountpoint on the client.
>>
>> Observing IO on both the client box & all the server boxes via iostat
>> shows that the disks are doing *very* little work.  Observing the CPU/
>> memory load with top or htop shows that none of the boxes are CPU or
>> memory bound.  Observing the bandwidth in/out of the network interface
>> shows <1MB/s throughput (we have a fully gigabit LAN!) which usually
>> drops down to <150KB/s during the copy.
>>
>> scp'ing the same directory structure from the same client to one of
>> the same servers will work at ~40-50MB/s sustained as a comparison.
>> Here is the results of copying the same directory structure using
>> rsync to the same partition:
>>
>> # time rsync -ap * benk at cfs1:~/cache/
>> benk at cfs1's password:
>>
>> real	0m23.566s
>> user	0m8.433s
>> sys	0m4.580s
>>
>> Ben
>>
>> On Jun 3, 2009, at 3:16 PM, Jasper van Wanrooy - Chatventure wrote:
>>
>>> Hi Benjamin,
>>>
>>> That's not good news. What kind of hardware do you use? Is it
>>> virtualised? Or do you use real boxes?
>>> What kind of files are you copying in your test? What performance do
>>> you have when copying it to a local dir?
>>>
>>> Best regards Jasper
>>>
>>> ----- Original Message -----
>>> From: "Benjamin Krein" <superbenk at superk.org>
>>> To: "Jasper van Wanrooy - Chatventure" <jvanwanrooy at chatventure.nl>
>>> Cc: "Vijay Bellur" <vijay at gluster.com>, gluster-users at gluster.org
>>> Sent: Wednesday, 3 June, 2009 19:23:51 GMT +01:00 Amsterdam /
>>> Berlin / Bern / Rome / Stockholm / Vienna
>>> Subject: Re: [Gluster-users] Horrible performance with small files
>>> (DHT/AFR)
>>>
>>> I reduced my config to only 2 servers (had to donate 2 of the 4 to
>>> another project).  I now have a single server using DHT (for future
>>> scaling) and AFR to a mirrored server.  Copy times are much better,
>>> but still pretty horrible:
>>>
>>> # time cp -rp * /mnt/
>>>
>>> real	21m11.505s
>>> user	0m1.000s
>>> sys	0m6.416s
>>>
>>> Ben
>>>
>>> On Jun 3, 2009, at 3:13 AM, Jasper van Wanrooy - Chatventure wrote:
>>>
>>>> Hi Benjamin,
>>>>
>>>> Did you also try with a lower thread-count. Actually I'm using 3
>>>> threads.
>>>>
>>>> Best Regards Jasper
>>>>
>>>>
>>>> On 2 jun 2009, at 18:25, Benjamin Krein wrote:
>>>>
>>>>> I do not see any difference with autoscaling removed.  Current
>>>>> server config:
>>>>>
>>>>> # webform flat-file cache
>>>>>
>>>>> volume webform_cache
>>>>> type storage/posix
>>>>> option directory /home/clusterfs/webform/cache
>>>>> end-volume
>>>>>
>>>>> volume webform_cache_locks
>>>>> type features/locks
>>>>> subvolumes webform_cache
>>>>> end-volume
>>>>>
>>>>> volume webform_cache_brick
>>>>> type performance/io-threads
>>>>> option thread-count 32
>>>>> subvolumes webform_cache_locks
>>>>> end-volume
>>>>>
>>>>> <<snip>>
>>>>>
>>>>> # GlusterFS Server
>>>>> volume server
>>>>> type protocol/server
>>>>> option transport-type tcp
>>>>> subvolumes dns_public_brick dns_private_brick webform_usage_brick
>>>>> webform_cache_brick wordpress_uploads_brick subs_exports_brick
>>>>> option auth.addr.dns_public_brick.allow 10.1.1.*
>>>>> option auth.addr.dns_private_brick.allow 10.1.1.*
>>>>> option auth.addr.webform_usage_brick.allow 10.1.1.*
>>>>> option auth.addr.webform_cache_brick.allow 10.1.1.*
>>>>> option auth.addr.wordpress_uploads_brick.allow 10.1.1.*
>>>>> option auth.addr.subs_exports_brick.allow 10.1.1.*
>>>>> end-volume
>>>>>
>>>>> # time cp -rp * /mnt/
>>>>>
>>>>> real	70m13.672s
>>>>> user	0m1.168s
>>>>> sys	0m8.377s
>>>>>
>>>>> NOTE: the above test was also done during peak hours when the LAN/
>>>>> dev server were in use which would cause some of the extra time.
>>>>> This is still WAY too much, though.
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>> On Jun 1, 2009, at 1:40 PM, Vijay Bellur wrote:
>>>>>
>>>>>> Hi Benjamin,
>>>>>>
>>>>>> Could you please try by turning autoscaling off?
>>>>>>
>>>>>> Thanks,
>>>>>> Vijay
>>>>>>
>>>>>> Benjamin Krein wrote:
>>>>>>> I'm seeing extremely poor performance writing small files to a
>>>>>>> glusterfs DHT/AFR mount point. Here are the stats I'm seeing:
>>>>>>>
>>>>>>> * Number of files:
>>>>>>> root at dev1|/home/aweber/cache|# find |wc -l
>>>>>>> 102440
>>>>>>>
>>>>>>> * Average file size (bytes):
>>>>>>> root at dev1|/home/aweber/cache|# ls -lR | awk '{sum += $5; n++;}
>>>>>>> END {print sum/n;}'
>>>>>>> 4776.47
>>>>>>>
>>>>>>> * Using scp:
>>>>>>> root at dev1|/home/aweber/cache|# time scp -rp * benk at cfs1:~/cache/
>>>>>>>
>>>>>>> real 1m38.726s
>>>>>>> user 0m12.173s
>>>>>>> sys 0m12.141s
>>>>>>>
>>>>>>> * Using cp to glusterfs mount point:
>>>>>>> root at dev1|/home/aweber/cache|# time cp -rp * /mnt
>>>>>>>
>>>>>>> real 30m59.101s
>>>>>>> user 0m1.296s
>>>>>>> sys 0m5.820s
>>>>>>>
>>>>>>> Here is my configuration (currently, single client writing to 4
>>>>>>> servers (2 DHT servers doing AFR):
>>>>>>>
>>>>>>> SERVER:
>>>>>>>
>>>>>>> # webform flat-file cache
>>>>>>>
>>>>>>> volume webform_cache
>>>>>>> type storage/posix
>>>>>>> option directory /home/clusterfs/webform/cache
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume webform_cache_locks
>>>>>>> type features/locks
>>>>>>> subvolumes webform_cache
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume webform_cache_brick
>>>>>>> type performance/io-threads
>>>>>>> option thread-count 32
>>>>>>> option max-threads 128
>>>>>>> option autoscaling on
>>>>>>> subvolumes webform_cache_locks
>>>>>>> end-volume
>>>>>>>
>>>>>>> <<snip>>
>>>>>>>
>>>>>>> # GlusterFS Server
>>>>>>> volume server
>>>>>>> type protocol/server
>>>>>>> option transport-type tcp
>>>>>>> subvolumes dns_public_brick dns_private_brick webform_usage_brick
>>>>>>> webform_cache_brick wordpress_uploads_brick subs_exports_brick
>>>>>>> option auth.addr.dns_public_brick.allow 10.1.1.*
>>>>>>> option auth.addr.dns_private_brick.allow 10.1.1.*
>>>>>>> option auth.addr.webform_usage_brick.allow 10.1.1.*
>>>>>>> option auth.addr.webform_cache_brick.allow 10.1.1.*
>>>>>>> option auth.addr.wordpress_uploads_brick.allow 10.1.1.*
>>>>>>> option auth.addr.subs_exports_brick.allow 10.1.1.*
>>>>>>> end-volume
>>>>>>>
>>>>>>> CLIENT:
>>>>>>>
>>>>>>> # Webform Flat-File Cache Volume client configuration
>>>>>>>
>>>>>>> volume srv1
>>>>>>> type protocol/client
>>>>>>> option transport-type tcp
>>>>>>> option remote-host cfs1
>>>>>>> option remote-subvolume webform_cache_brick
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume srv2
>>>>>>> type protocol/client
>>>>>>> option transport-type tcp
>>>>>>> option remote-host cfs2
>>>>>>> option remote-subvolume webform_cache_brick
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume srv3
>>>>>>> type protocol/client
>>>>>>> option transport-type tcp
>>>>>>> option remote-host cfs3
>>>>>>> option remote-subvolume webform_cache_brick
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume srv4
>>>>>>> type protocol/client
>>>>>>> option transport-type tcp
>>>>>>> option remote-host cfs4
>>>>>>> option remote-subvolume webform_cache_brick
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume afr1
>>>>>>> type cluster/afr
>>>>>>> subvolumes srv1 srv3
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume afr2
>>>>>>> type cluster/afr
>>>>>>> subvolumes srv2 srv4
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume dist
>>>>>>> type cluster/distribute
>>>>>>> subvolumes afr1 afr2
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume writebehind
>>>>>>> type performance/write-behind
>>>>>>> option cache-size 4mb
>>>>>>> option flush-behind on
>>>>>>> subvolumes dist
>>>>>>> end-volume
>>>>>>>
>>>>>>> volume cache
>>>>>>> type performance/io-cache
>>>>>>> option cache-size 512mb
>>>>>>> subvolumes writebehind
>>>>>>> end-volume
>>>>>>>
>>>>>>> Benjamin Krein
>>>>>>> www.superk.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>>>>
>>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users