[Gluster-users] glusterfs, striped volume x8, poor sequential read performance, good write performance

Mon Jul 21 20:35:15 UTC 2014

Hi,

I have a HPC installation with 8 nodes. Each node has a software 
RAID1 using two NLSAS disks. And the disks from 8 nodes are combined into
large shared striped 20Tb glusterfs partition which seems to show 
abnormally slow sequential read performance, with good write performance.

Basically I see is that the write performance is very decent  ~ 
500Mb/sec (tested using dd):

[root at XXXX bigstor]# dd if=/dev/zero of=test2 bs=1M count=100000 
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 186.393 s, 563 MB/s

And all this is is not just seating in the cache of each node, as I see the 
data being flushed to disks with approximately right speed.

In the same time the read performance is 
(tested using dd with dropping of the caches beforehand) is really bad:

[root at XXXX bigstor]# dd if=/data/bigstor/test of=/dev/null bs=1M 
count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 309.821 s, 33.8 MB/s

When doing this glusterfs processes only take ~ 10-15% of the CPU max. So it 
isn't CPU starving.

The underlying  devices do not seem to be loaded at all: 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00   73.00    0.00  9344.00     0.00   256.00     0.11    1.48   1.47  10.70

To check that the disks are not the problem 
I did a separate test of the read-speed of the raided disks on all machines 
and they have read speads of ~ 180Mb/s (uncached). So they aren't the 
problem.

I also tried to increase the readahead on the raid disks
echo 2048 > /sys/block/md126/queue/read_ahead_kb
but that doesn't seem to help at all.

Does anyone have any advice what to do here ? What knobs to adjust ? 
To me it looks like a bug, being honest,  but I would be happy if there is 
magic switch I forgot to turn on )

Here is more details about my system

OS: Centos 6.5
glusterfs : 3.4.4
Kernel 2.6.32-431.20.3.el6.x86_64
mount options and df output:

[root at XXXX bigstor]# cat /etc/mtab

/dev/md126p4 /data/glvol/brick1 xfs rw 0 0
node1:/glvol /data/bigstor fuse.glusterfs  rw,default_permissions,allow_other,max_read=131072 0 0

[root at XXXX bigstor]# df
Filesystem       1K-blocks        Used  Available Use% Mounted on
/dev/md126p4    2516284988  2356820844  159464144  94% /data/glvol/brick1
node1:/glvol   20130279808 18824658688 1305621120  94% /data/bigstor

brick info:
xfs_info  /data/glvol/brick1
meta-data=/dev/md126p4           isize=512    agcount=4, agsize=157344640 
blks
          =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=629378560, imaxpct=5
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=307313, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Here is the gluster info:
[root at XXXXXX bigstor]# gluster
gluster> volume info glvol

Volume Name: glvol
Type: Stripe
Volume ID: 53b2f6ad-46a6-4359-acad-dc5b6687d535
Status: Started
Number of Bricks: 1 x 8 = 8
Transport-type: tcp
Bricks:
Brick1: node1:/data/glvol/brick1/brick
Brick2: node2:/data/glvol/brick1/brick
Brick3: node3:/data/glvol/brick1/brick
Brick4: node4:/data/glvol/brick1/brick
Brick5: node5:/data/glvol/brick1/brick
Brick6: node6:/data/glvol/brick1/brick
Brick7: node7:/data/glvol/brick1/brick
Brick8: node8:/data/glvol/brick1/brick

The network I use is the ip over infiniband with very high throughput.

I also saw the discussion here on similar issue:
http://supercolony.gluster.org/pipermail/gluster-users/2013-February/035560.html
but it was blamed on ext4.

Thanks in advance,
 	Sergey

PS I also looked at the contents of the  /var/lib/glusterd/vols/glvol/glvol-fuse.vol
and saw this, I don't know whether that's relevant or not
volume glvol-client-0
     type protocol/client
     option transport-type tcp
     option remote-subvolume /data/glvol/brick1/brick
     option remote-host node1
end-volume
...... 
volume glvol-stripe-0
     type cluster/stripe
     subvolumes glvol-client-0 glvol-client-1 glvol-client-2 glvol-client-3 
glvol-client-4 glvol-client-5 glvol-client-6 glvol-client-7
end-volume

volume glvol-dht
     type cluster/distribute
     subvolumes glvol-stripe-0
end-volume

volume glvol-write-behind
     type performance/write-behind
     subvolumes glvol-dht
end-volume

volume glvol-read-ahead
     type performance/read-ahead
     subvolumes glvol-write-behind
end-volume

volume glvol-io-cache
     type performance/io-cache
     subvolumes glvol-read-ahead
end-volume

volume glvol-quick-read
     type performance/quick-read
     subvolumes glvol-io-cache
end-volume

volume glvol-open-behind
     type performance/open-behind
     subvolumes glvol-quick-read
end-volume

volume glvol-md-cache
     type performance/md-cache
     subvolumes glvol-open-behind
end-volume

volume glvol
     type debug/io-stats
     option count-fop-hits off
     option latency-measurement off
     subvolumes glvol-md-cache
end-volume

*****************************************************
Sergey E. Koposov, PhD, Senior Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/