[Gluster-users] Throughout over infiniband

Sun Sep 9 20:28:47 UTC 2012

Hi Corey, 

Let me share the results of testing that i've been doing for the past 5 weeks or so. As in your experience, the results are no where near to what i've been expecting. What a disappointment. Anyway, here we go. 

I am using Centos 6.3 with the latest updates and patches using the latest QLogic OFED version 1.5.3.x; Qlogic drivers with OFED 1.5.4.x is not compiling on Centos 6.3. I've also tried vanilla OFED 1.5.4.1 and Mellanox OFED 1.5.3.x with pretty much similar results. I've been testing Glusterfs 3.2.7 and 3.3.0. No significant performance difference between 3.2 and 3.3 brunch. 

My hardware is 1 storage server with Mellanox QDR dual port card. Two server nodes with Qlogic dual port mezzanine cards and QLogic QDR (HP) blade switch. Storage server uses ZFS made of 4 stripes of 2 disks mirror + 240gb SSD for ZIL + 240gb SSD for L2ARC cache. I also enabled compression + deduplication. Underlying ZFS performance using iozone tests ( iozone -+u -t 2 -F f1 f2 -r 2048 -s 30G) is between 4GB/s and 10GB/s depending on the test levels. Infiniband fabric tests using rdma were giving between 3 and 4 GB/s. Please note GigaBytes NOT GigaBits per second. 

So, I was expecting to have a throughput of around 2.5 - 3 GB/s over glusterfs rdma taking into account overheads. yeah, right, wishful thinking it was!!! 

I've built my PoC environment and started testing with just one client and i've been getting around 400-600mb/s tops. Writes were about 20% faster than reads. Following some performance tuning on the glusterfs and zfs side I've managed to increase throughput to around 700-800mb/s with writes still being about 20% faster. To note that adding the "-o" switch to the iozone command to use the synchronised writes the writes throughput was limited to the ZIL SSD speed. 

While trying to figure out the cause of the bottleneck i've realised that the bottle neck is coming from the client side as running concurrent test from two clients would give me about 650mb/s per each client. Doing a bit more research it seems that the cause of the problem is with FUSE. googleing for this issue i've found a number of people complaining the limit of fuse throughput at around 600-700mb/s. There is a kernel patch to address this issue, but the results of testing from several people showed only a marginal increase in performance. Guys managed to increase their throughput from around 600mb/s to about 850mb/s or so. Thus, from what i've read, it's currently not being possible to achieve speeds over 1GB/s with fuse. This made me wonder the reason behind choosing to use fuse in the first place for the client side glusterfs. 

P.S. If you are looking to use glusterfs as the backend storage for the kvm virtualisation, I would warn you that it's a tricky business. I've managed to make things work, but the performance is far worse than any of my pessimistic expectations! An example - a mounted glusterfs-rdma file system on the server running kvm would give me around 700-850mb/s throughput. I was only getting 50mb/s max when doing the test from the vm stored on that partition. In comparison, nfs would give me around 350-400mb/s. I have never expected glustefs to perform worse than nfs. 

I would be grateful if anyone would share their experience with glusterfs over infiniband and their tips on improving performance. 

cheers 

Andrei 

----- Original Message -----

From: "Corey Kovacs" <corey.kovacs at gmail.com> 
To: gluster-users at gluster.org 
Sent: Friday, 7 September, 2012 2:45:48 PM 
Subject: [Gluster-users] Throughout over infiniband 

Folks, 

I finally got my hands on a 4x FDR (56Gb) Infiniband switch and 4 cards to do some testing of GlusterFS over that interface. 

So far, I am not getting the throughput I _think_ I should see. 

My config is made up of.. 

4 dl360-g8's (three bricks and one client) 
4 4xFDR, dual port IB cards (one port configured in each card per host) 
1 4xFDR 36 port Mellanox Switch (managed and configured) 
GlusterFS 3.2.6 
RHEL6.3 

I have tested the IB cards and get about 6GB between hosts over raw IB. Using ipoib, I can get about 22Gb/sec. Not too shabby for a first go but I expected more (cards are in connected mode with MTU of 64k). 

My raw speed to the disks (though the buffer cache... I just realized I've not tested direct mode IO, I'll do that later today) is about 800MB/sec. I expect to see on the order of 2GB/sec (a little less than 3x800). 

When I write a large stream using dd, and watch the bricks I/O I see ~800MB/sec on each one, but at the end of the test, the report from dd indicates 800MB/sec. 

Am I missing something fundamental? 

Any pointers would be appreciated, 

Thanks! 

Corey 

_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120909/b7aef193/attachment.html>