[Gluster-users] This benchmark is OK?

Mon May 21 12:36:31 UTC 2012

Lawrence, here are my answers to your questions, based on the analysis that follows them.

Q1: Why the Read is better than write, is ok?
Less-than optimal bonding mode is reason for difference

Q2: Refer to experience, This benchmark is best , good or bad?
The results are consistent with past experience with bonding mode 0 (round-robin)

Q3: How to optimize the gluster to impove the benchmark?
I suspect network configuration needs optimization, but you might not need striping feature.

Q4: How can I get other friends' benchmark, which I use to compared?
I don't understand question, but iozone is what I use for sequential I/O benchmarking

configuration -- It appears that you have 2 servers and 6 clients, and your gluster volume is striped, with no replication.

servers:
each server has 4 cores
each server has 4-way bonding mode 0
each server has 12 1-TB drives configured as 2 6-drive RAID volumes (what kind of RAID?)

clients:
each has 2-way bonding mode 0
client disks are irrelevant since gluster does not use them 

results:
You do cluster iozone sequential write test followed by sequential read test with 25 threads, total of 50 GB data, using 1-MB transfer size, including fsync and close in throughput calculation.  Results are:

initial write: 414 MB/s   
re-write:  447 MB/s
initial read: 655 MB/s
re-read: 778 MB/s

Clients have 12 NICs and servers have 8 NICs, so cross-sectional bandwidth between clients and servers is ~800 MB/s.  So for your volume type you would treat 800 MB/s as the theoretical limit of your iozone throughput.  It appears that you have enough threads and enough disk drives in your servers that your storage should not be the bottleneck.   With this I/O transfer size and volume type, a server CPU bottleneck is less likely, but you should still check.

With Linux bonding, the biggest challenge is to load-balancing INCOMING traffic to the servers (writes) -- almost any mode can load-balance outgoing traffic (reads) across NICs in bond.  For writes, you are transmitting from the many client NICs to the fewer server NICs.  The problem with bonding mode 0 is that ARP protocol associates a single MAC address with a single IP address, and bonding mode 0 assigns the SAME MAC ADDRESS to all NICs, so the network switch will "learn" that the last port that transmitted with that MAC address is the port where all receives should take place for that MAC address.  This will reduce the effectiveness of bonding at balancing receive load across available server NICs.  For better load-balancing of receive traffic by switch and servers, try:

bonding mode 6 (balance-alb) -- use if clients and servers are mostly on the same VLAN.  In this mode, the Linux bonding driver will use ARP to load-balance clients across available server NICs (NICs retain unique MAC addresses), so network switch can deliver IP packets from different clients to different server NICs.   This can result in optimal utilization of server NICs when clients/server ratio is larger than number of server NICs, usually with no switch configuration necessary.

bonding mode 4 (803.2ad "trunking") -- if switch supports it, you can configure the switch and the servers to treat all server NICs as a single "trunk".  Any incoming IP packet destined for that server can be passed by the switch to whichever server NIC is least busy (subject to constraints?).  This works even when clients are on a different subnet, and does not depend on ARP protocol, but both servers and switch must be configured for this to work, and switch configuration in the past has been vendor-specific.

Also, I do not see how using gluster striping feature will improve your performance with this workload.  Gluster striping could be expected to help under 2 conditions:
- Gluster client can read/write data much faster than any one server can
- Gluster client is only reading/writing one or two files at a time
Neither of these conditions is satisfied by your workload and configuration.

You achieved close to the network throughput limit for re-reads.  The difference between initial read and re-read result suggests that you might be able to improve your initial read result with better pre-fetching on the server block devices.

Ben England, Red Hat

----- Original Message -----
From: "Amar Tumballi" <amarts at redhat.com>
To: "Ben England" <bengland at redhat.com>
Sent: Saturday, May 19, 2012 2:22:03 AM
Subject: Fwd: [Gluster-users]  This benchmark is OK?

Ben,

When you have time, can you have a look on this thread and respond?

-Amar

-------- Original Message --------
Subject: 	[Gluster-users] This benchmark is OK?
Date: 	Thu, 17 May 2012 00:11:48 +0800
From: 	soft_lawrency at hotmail.com <soft_lawrency at hotmail.com>
Reply-To: 	soft_lawrency <soft_lawrency at hotmail.com>
To: 	gluster-users <gluster-users at gluster.org>
CC: 	wangleiyf <wangleiyf at initdream.com>

Hi Amar,
here is my benchmark, pls help me to evaluate it.
1. [Env - Storage servers]
Gluster version: 3.3 beta3
OS : CentOS 6.1
2* Server : CPU : E5506 @ 2.13GHz 4*core
MEM: 16G
Disk : 1T * 12
Net: bond0 = 4 * 1Gb
2. [Env - Testing clients]
OS: CentOS 6.1
6* Server: CPU: 5150 @ 2.66GHz
MEM: 8G
Disk: RAID0 = 1T * 3
Net: bond0 = 2 * 1Gb
3. [Env - Switch]
1 * H3C Switch
4. [Env - Testing Tools]
iozone for cluster, clients is rsh cluster.
5. Volume info
Type: distributed - stripe
bricks: 6 * 4 = 24
6. Iozone command:
./iozone -r 1m -s 50g -t 25 -i 0 -i 1 -+m /CZFS/client_list -R -b
report.xls -c -C -+k -e
then my benchmark is :
" Initial write " 424258.41 KB
" Rewrite " 458250.97 KB
" Read " 671079.30 KB
" Re-read " 797509.20 KB
here I have 4 Questions:
Q1: Why the Read is better than write, is ok?
Q2: Refer to experience, This benchmark is best , good or bad?
Q3: How to optimize the gluster to impove the benchmark?
Q4: How can I get other friends' benchmark, which I use to compared?
Tks very much.
Regards,
Lawrence.