[Gluster-users] Quick performance tests
Joe Landman
landman at scalableinformatics.com
Sat Jan 15 18:31:32 UTC 2011
Given the discussion over the past few days, I did a quick-n-dirty test.
Long-ish post, with data, pointers, etc.
Gigabit connected server and client, 941 Mb/s (according to iperf)
between the two.
Untar 2.6.37 kernel source, drop caches before each run. 485 MB total
untarred/uncompressed size.
all times measured in seconds, all mounts with default options (though
we used -o intr,tcp for NFS)
server client client client client
local local NFS Gluster-NFS GlusterFS
-------------------------------------------------
3.9 9 85.97 143.5 132.3
So Gluster-NFS translator (using NFS on client to mount file system on
remote system) requires about 67% more time than straight NFS, and the
Gluster mount on the client requires about 54% more time than NFS.
Does this mean NFS is faster? In this simplified measurement, yes, but
not by a huge amount. And we don't recommend extrapolating to the
general case from this.
Moreover, we haven't tuned the GlusterFS implementation at all. For
laughs, I turned up the caching a bit
[root at jr5-lab local]# gluster volume set nfs-test performance.cache-size 1GB
Set volume successful
[root at jr5-lab local]# gluster volume set nfs-test
performance.write-behind-window-size 512MB
Set volume successful
[root at jr5-lab local]# gluster volume set nfs-test
performance.stat-prefetch 1
Set volume successful
(note: even with 3.1.2, that last bits are still undocumented)
With this, I was able to get the Gluster-NFS client to be about the same
as the native Gluster client.
There is still tuning that could be done, but there is a significant
performance cost to doing many stat calls. There is little that you can
really do about that, other than to not do so many stat calls (which my
not be an option in and of itself).
Our test looked like this btw:
/usr/bin/time --verbose tar -xzf ~/kernel-2.6.37.scalable.tar.gz
which provides a great deal of information to the end user:
Command being timed: "tar -xzf /root/kernel-2.6.37.scalable.tar.gz"
User time (seconds): 11.08
System time (seconds): 14.53
Percent of CPU this job got: 29%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.53
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4240
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 566
Voluntary context switches: 297422
Involuntary context switches: 691
Swaps: 0
File system inputs: 187120
File system outputs: 951496
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
For laughs, I also wrote up a quick and dirty strace parsing tool, so
you can run an experiment like this:
strace tar -xzf ~/kernel-2.6.37.scalable.tar.gz > q 2>&1
cat q | ~/iohist.pl
and then you'll see this
[root at virtual nfs]# cat q | ~/iohist.pl
read operations: 42563
write operations: 73798
meta operations: 159777
Total operations: 276138
read size (Bytes): 434322885
write size (Bytes): 403743188
Total IO size (Bytes) : 838066073
Average read size (Bytes): 10204.2
Average write size (Bytes): 5470.9
bin resolution = 512
With a little work, I can have it pull out timing information from
strace, and then we can construct an average time per operation.
We get about the same data regardless of local versus remote writes.
This should help elucidate why traversing two network and storage stacks
is so much more costly than traversing one. Same number of operations,
just a higher cost per operation. Which strongly suggests you want to
amortize each operation over a larger read/write.
It also generates a read.hist and a write.hist which output binned data
(currently with 512 byte resolution). As you can see, there isn't too
much in the way of reads (apart from the tarball at about 10kB), and
quite a bit in the way of writes (actually follows a nice distribution
apart from the outliers at the end)
[root at virtual nfs]# cat read.hist
6
10
1
2
72
0
0
0
59
0
0
0
59
0
0
0
72
0
0
0
42282
[root at virtual nfs]# cat write.hist
5656
5596
4906
4387
3718
3187
2866
2588
2421
2197
1956
1830
1660
1600
1470
1336
1315
1198
1018
1046
21847
Basically, those reads, writes, and meta-ops are expensive over the
wire. For NFS and GlusterFS, and any other network/cluster file system.
If you are deploying a web server scenario, you might want to set up a
local cache (RAMdisk or local SSD based), fed from GlusterFS on start.
I should also point out, that this is what we mean by large files (think
VM images) read and written 1MB at a time.
[root at virtual nfs]# dd if=/dev/zero of=big.file bs=1M count=1k
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.6762 seconds, 101 MB/s
[root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches
[root at virtual nfs]# dd if=big.file of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.12858 seconds, 118 MB/s
And this is what iohist has to say about this:
[root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches
[root at virtual nfs]# strace dd if=big.file of=/dev/null bs=1M > b 2>&1
[root at virtual nfs]# cat b | ~/iohist.pl
read operations: 1030
write operations: 1025
meta operations: 18
Total operations: 2073
read size (Bytes): 1073746848
write size (Bytes): 1073741856
Total IO size (Bytes) : 2147488704
Average read size (Bytes): 1042472.7
Average write size (Bytes): 1047553.0
bin resolution = 512
read.hist and write.hist show most the 1M reads/writes, apart from minor
metadata bits.
You can grab iohist.pl here:
http://download.scalableinformatics.com/iohist/iohist.pl
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Gluster-users
mailing list