[Gluster-users] gluster client performance

Mon Jul 25 23:15:50 UTC 2011

Hi,

Here's our QDR IB gluster setup:

http://piranha.structbio.vanderbilt.edu

We're still using gluster 3.0 on all our servers and clients as well
as CENTOS5.6 kernels and ofed 1.4. To simulate a single stream I use
this nfsSpeedTest script I wrote :

http://code.google.com/p/nfsspeedtest/

>From a single QDR IB connected client to our /pirstripe directory
which is a stripe of the gluster storage servers, this is the
performance I get (note use a file size > amount of RAM on client and
server systems, 13GB in this case) :

4k block size :

111 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 142.281 MB/s 1138.247 mbps 93.561 seconds
pir4: Read test (dd): 274.321 MB/s 2194.570 mbps 48.527 seconds

testing from 8k - 128k block size on the dd, best performance was
achieved at 64k block sizes:

114 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -b 64k -y
pir4: Write test (dd): 213.344 MB/s 1706.750 mbps 62.397 seconds
pir4: Read test (dd): 955.328 MB/s 7642.620 mbps 13.934 seconds

This is to the /pirdist directories which are mounted in distribute
mode (file is written to only one of the gluster servers) :

105 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y
pir4: Write test (dd): 182.410 MB/s 1459.281 mbps 72.978 seconds
pir4: Read test (dd): 244.379 MB/s 1955.033 mbps 54.473 seconds

106 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
pir4: Write test (dd): 204.297 MB/s 1634.375 mbps 65.160 seconds
pir4: Read test (dd): 340.427 MB/s 2723.419 mbps 39.104 seconds

For reference/control, here's the same test writing straight to the
XFS filesystem on one of the gluster storage nodes:

[sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y
gluster1: Write test (dd): 398.971 MB/s 3191.770 mbps 33.366 seconds
gluster1: Read test (dd): 234.563 MB/s 1876.501 mbps 56.752 seconds

[sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k
gluster1: Write test (dd): 442.251 MB/s 3538.008 mbps 30.101 seconds
gluster1: Read test (dd): 219.708 MB/s 1757.660 mbps 60.590 seconds

The read test seems to scale linearly with the # of storage servers
(almost 1GB/s!). Interestingly, the /pirdist read test at 64k block
size was 120MB/s faster than the read test straight from XFS, however,
it could have been that gluster1 was busy and when I read from
/pirdist the file was actually being read from one of the other 4 less
busy storage nodes.

Here's our storage node setup (many of these settings may not apply to v3.2) :

####

volume posix-stripe
  type storage/posix
  option directory /export/gluster1/stripe
end-volume

volume posix-distribute
        type storage/posix
        option directory /export/gluster1/distribute
end-volume

volume locks
  type features/locks
  subvolumes posix-stripe
end-volume

volume locks-dist
  type features/locks
  subvolumes posix-distribute
end-volume

volume iothreads
  type performance/io-threads
  option thread-count 16
  subvolumes locks
end-volume

volume iothreads-dist
  type performance/io-threads
  option thread-count 16
  subvolumes locks-dist
end-volume

volume server
  type protocol/server
  option transport-type ib-verbs
  option auth.addr.iothreads.allow 10.2.178.*
  option auth.addr.iothreads-dist.allow 10.2.178.*
  option auth.addr.locks.allow 10.2.178.*
  option auth.addr.posix-stripe.allow 10.2.178.*
  subvolumes iothreads iothreads-dist locks posix-stripe
end-volume

####

Here's our stripe client setup :

####

volume client-stripe-1
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster1
  option remote-subvolume iothreads
end-volume

volume client-stripe-2
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster2
  option remote-subvolume iothreads
end-volume

volume client-stripe-3
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster3
  option remote-subvolume iothreads
end-volume

volume client-stripe-4
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster4
  option remote-subvolume iothreads
end-volume

volume client-stripe-5
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster5
  option remote-subvolume iothreads
end-volume

volume readahead-gluster1
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-1
end-volume

volume readahead-gluster2
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-2
end-volume

volume readahead-gluster3
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-3
end-volume

volume readahead-gluster4
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-4
end-volume

volume readahead-gluster5
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-stripe-5
end-volume

volume writebehind-gluster1
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster1
end-volume

volume writebehind-gluster2
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster2
end-volume

volume writebehind-gluster3
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster3
end-volume

volume writebehind-gluster4
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster4
end-volume

volume writebehind-gluster5
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster5
end-volume

volume quick-read-gluster1
	type performance/quick-read
	subvolumes writebehind-gluster1
end-volume

volume quick-read-gluster2
	type performance/quick-read
	subvolumes writebehind-gluster2
end-volume

volume quick-read-gluster3
	type performance/quick-read
	subvolumes writebehind-gluster3
end-volume

volume quick-read-gluster4
	type performance/quick-read
	subvolumes writebehind-gluster4
end-volume

volume quick-read-gluster5
	type performance/quick-read
	subvolumes writebehind-gluster5
end-volume

volume stat-prefetch-gluster1
	type performance/stat-prefetch
    #subvolumes quick-read-gluster1
    subvolumes writebehind-gluster1
end-volume

volume stat-prefetch-gluster2
	type performance/stat-prefetch
    #subvolumes quick-read-gluster2
    subvolumes writebehind-gluster2
end-volume

volume stat-prefetch-gluster3
	type performance/stat-prefetch
    #subvolumes quick-read-gluster3
    subvolumes writebehind-gluster3
end-volume

volume stat-prefetch-gluster4
	type performance/stat-prefetch
    #subvolumes quick-read-gluster4
    subvolumes writebehind-gluster4
end-volume

volume stat-prefetch-gluster5
	type performance/stat-prefetch
    #subvolumes quick-read-gluster5
    subvolumes writebehind-gluster5
end-volume

volume stripe
  type cluster/stripe
  option block-size 2MB
  #subvolumes client-stripe-1 client-stripe-2 client-stripe-3
client-stripe-4 client-stripe-5
  #subvolumes readahead-gluster1 readahead-gluster2 readahead-gluster3
readahead-gluster4 readahead-gluster5
  #subvolumes writebehind-gluster1 writebehind-gluster2
writebehind-gluster3 writebehind-gluster4 writebehind-gluster5
  #subvolumes quick-read-gluster1 quick-read-gluster2
quick-read-gluster3 quick-read-gluster4 quick-read-gluster5
  subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2
stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5
end-volume

####

Quick read was disabled because there was a bug that causes a crash
when that's enabled. This has been fixed in more recent versions but I
haven't upgraded. Here's our client distribute setup :

####

volume client-distribute-1
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster1
  option remote-subvolume iothreads-dist
end-volume

volume client-distribute-2
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster2
  option remote-subvolume iothreads-dist
end-volume

volume client-distribute-3
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster3
  option remote-subvolume iothreads-dist
end-volume

volume client-distribute-4
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster4
  option remote-subvolume iothreads-dist
end-volume

volume client-distribute-5
  type protocol/client
  option transport-type ib-verbs
  option remote-host gluster5
  option remote-subvolume iothreads-dist
end-volume

volume readahead-gluster1
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-distribute-1
end-volume

volume readahead-gluster2
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-distribute-2
end-volume

volume readahead-gluster3
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-distribute-3
end-volume

volume readahead-gluster4
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-distribute-4
end-volume

volume readahead-gluster5
  type performance/read-ahead
  option page-count 4           # 2 is default
  option force-atime-update off # default is off
  subvolumes client-distribute-5
end-volume

volume writebehind-gluster1
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster1
end-volume

volume writebehind-gluster2
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster2
end-volume

volume writebehind-gluster3
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster3
end-volume

volume writebehind-gluster4
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster4
end-volume

volume writebehind-gluster5
  type performance/write-behind
  option flush-behind on
  subvolumes readahead-gluster5
end-volume

volume quick-read-gluster1
	type performance/quick-read
	subvolumes writebehind-gluster1
end-volume

volume quick-read-gluster2
	type performance/quick-read
	subvolumes writebehind-gluster2
end-volume

volume quick-read-gluster3
	type performance/quick-read
	subvolumes writebehind-gluster3
end-volume

volume quick-read-gluster4
	type performance/quick-read
	subvolumes writebehind-gluster4
end-volume

volume quick-read-gluster5
	type performance/quick-read
	subvolumes writebehind-gluster5
end-volume

volume stat-prefetch-gluster1
	type performance/stat-prefetch
    subvolumes quick-read-gluster1
end-volume

volume stat-prefetch-gluster2
	type performance/stat-prefetch
    subvolumes quick-read-gluster2
end-volume

volume stat-prefetch-gluster3
	type performance/stat-prefetch
    subvolumes quick-read-gluster3
end-volume

volume stat-prefetch-gluster4
	type performance/stat-prefetch
    subvolumes quick-read-gluster4
end-volume

volume stat-prefetch-gluster5
	type performance/stat-prefetch
    subvolumes quick-read-gluster5
end-volume

volume distribute
  type cluster/distribute
  #option block-size 2MB
  #subvolumes client-distribute-1 client-distribute-2
client-distribute-3 client-distribute-4 client-distribute-5
  option min-free-disk 1%
  #subvolumes writebehind-gluster1 writebehind-gluster2
writebehind-gluster3 writebehind-gluster4 writebehind-gluster5
  subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2
stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5
end-volume

####

I don't know why my writes are so slow compared to reads. Let me know
if you're able to get better write speeds with the newer version of
gluster and any of the configurations (if they apply) that I've
posted. It might compel me to upgrade.

HTH,
Sabuj Pattanayek

> For some background, our compute cluster has 64 compute nodes. The gluster
> storage pool has 10 Dell PowerEdge R515 servers, each with 12 x 2 TB disks.
> We have another 16 Dell PowerEdge R515s used as Lustre storage servers. The
> compute and storage nodes are all connected via QDR Infiniband. Both Gluster
> and Lustre are set to use RDMA over Infiniband. We are using OFED version
> 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5 on both the compute and storage
> nodes.
>
> Oddly, it seems like there's some sort of bottleneck on the client side --
> for example, we're only seeing about 50 MB/s write throughput from a single
> compute node when writing a 10GB file. But, if we run multiple simultaneous
> writes from multiple compute nodes to the same Gluster volume, we get 50
> MB/s from each compute node. However, running multiple writes from the same
> compute node does not increase throughput. The compute nodes have 48 cores
> and 128 GB RAM, so I don't think the issue is with the compute node
> hardware.
>
> With Lustre, on the same hardware, with the same version of OFED, we're
> seeing write throughput on that same 10 GB file as follows: 476 MB/s single
> stream write from a single compute node and aggregate performance of more
> like 2.4 GB/s if we run simultaneous writes. That leads me to believe that
> we don't have a problem with RDMA, otherwise Lustre, which is also using
> RDMA, should be similarly affected.
>
> We have tried both xfs and ext4 for the backend file system on the Gluster
> storage nodes (we're currently using ext4). We went with distributed (not
> distributed striped) for the Gluster volume -- the thought was that if there
> was a catastrophic failure of one of the storage nodes, we'd only lose the
> data on that node; presumably with distributed striped you'd lose any data
> striped across that volume, unless I have misinterpreted the documentation.
>
> So ... what's expected/normal throughput for Gluster over QDR IB to a
> relatively large storage pool (10 servers / 120 disks)? Does anyone have
> suggested tuning tips for improving performance?
>
> Thanks!
>
> John
>
> --
>
> ________________________________________________________
>
> John Lalande
> University of Wisconsin-Madison
> Space Science & Engineering Center
> 1225 W. Dayton Street, Room 439, Madison, WI 53706
> 608-263-2268 / john.lalande at ssec.wisc.edu
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>