[Gluster-users] Gluster-users Digest, Vol 75, Issue 25 - striped volume x8, poor sequential read performance

Wed Jul 30 01:53:58 UTC 2014

Sergey, cmts inline...

Is your intended workload really single-client single-thread?    Or is it more MPI-like?  For example, do you have many clients reading from different parts of the same large file?  If the latter, perhaps IOR would be a better benchmark for you.

Sorry I'm not familiar with striping translator.

----- Original Message -----
> From: gluster-users-request at gluster.org
> To: gluster-users at gluster.org
> Sent: Tuesday, July 22, 2014 7:21:56 AM
> Subject: Gluster-users Digest, Vol 75, Issue 25
> 
> ------------------------------
> 
> Message: 9
> Date: Mon, 21 Jul 2014 21:35:15 +0100 (BST)
> From: Sergey Koposov <koposov at ast.cam.ac.uk>
> To: gluster-users at gluster.org
> Subject: [Gluster-users] glusterfs, striped volume x8, poor sequential
> 	read performance, good write performance
> Message-ID:
> 	<alpine.LRH.2.11.1407212046110.17942 at calx115.ast.cam.ac.uk>
> Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
> 
> Hi,
> 
> I have a HPC installation with 8 nodes. Each node has a software
> RAID1 using two NLSAS disks. And the disks from 8 nodes are combined into
> large shared striped 20Tb glusterfs partition which seems to show
> abnormally slow sequential read performance, with good write performance.
> 
> Basically I see is that the write performance is very decent  ~
> 500Mb/sec (tested using dd):
> 
> [root at XXXX bigstor]# dd if=/dev/zero of=test2 bs=1M count=100000
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 186.393 s, 563 MB/s
> 
> And all this is is not just seating in the cache of each node, as I see the
> data being flushed to disks with approximately right speed.
> 
> In the same time the read performance is
> (tested using dd with dropping of the caches beforehand) is really bad:
> 
> [root at XXXX bigstor]# dd if=/data/bigstor/test of=/dev/null bs=1M
> count=10000
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 309.821 s, 33.8 MB/s
> 
> When doing this glusterfs processes only take ~ 10-15% of the CPU max. So it
> isn't CPU starving.
> 
> The underlying  devices do not seem to be loaded at all:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               0.00     0.00   73.00    0.00  9344.00     0.00   256.00
> 0.11    1.48   1.47  10.70
> 
> To check that the disks are not the problem
> I did a separate test of the read-speed of the raided disks on all machines
> and they have read speads of ~ 180Mb/s (uncached). So they aren't the
> problem.
> 

Gluster has a read-ahead-page-count setting, I'd try setting it up to 16 (as high as it will go), default is 4.  Writes are different because the write to a brick can complete before the data hits the disk (in other words, as soon as the data reaches server memory), but with reads if the data is not cached in memory then your only solution is to get all bricks reading at the same time.

Contrast this with a single-brick 12-disk RAID6 volume (with 32-MB readahead) that can hit 800 MB/s on read.  Clearly it isn't the rest of Gluster that's holding you back, it's probably the stripe translator behavior.  Does stripe translator support parallel reads to different subvolumes in the stripe?  Can you post a protocol trace that shows the on-the-wire behavior (collect with tcpdump, display with wireshark).

You could try running a re-read test without the stripe translator, I suspect it will perform better based on my own experience.

> I also tried to increase the readahead on the raid disks
> echo 2048 > /sys/block/md126/queue/read_ahead_kb
> but that doesn't seem to help at all.
> 

To prove this, try re-reading a file that fits in Linux buffer cache on servers -- block device readahead is then irrelevant since there is no disk I/O at all.  You are then doing a network test with Gluster.

Also, try just doing a dd read from the "brick" (subvolume) directly.

> Does anyone have any advice what to do here ? What knobs to adjust ?
> To me it looks like a bug, being honest,  but I would be happy if there is
> magic switch I forgot to turn on )
> 

Second, if you are using IPOIB, try jumbo frame setting of MTU=65520 and MODE=connected (in ifcfg-ib0) to reduce Infiniband interrupts on client side.  

Try FUSE mount option "-o gid-timeout=2" . 

What is the stripe width of the Gluster volume in KB?  Looks like it's the default, I forget what this is but you probably want it to be something like 128 KB x 8.  A very large stripe size will prevent Gluster from utilizing > 1 brick at the same time.

> Here is more details about my system
> 
> OS: Centos 6.5
> glusterfs : 3.4.4
> Kernel 2.6.32-431.20.3.el6.x86_64
> mount options and df output:
> 
> [root at XXXX bigstor]# cat /etc/mtab
> 
> /dev/md126p4 /data/glvol/brick1 xfs rw 0 0
> node1:/glvol /data/bigstor fuse.glusterfs
> rw,default_permissions,allow_other,max_read=131072 0 0
> 
> [root at XXXX bigstor]# df
> Filesystem       1K-blocks        Used  Available Use% Mounted on
> /dev/md126p4    2516284988  2356820844  159464144  94% /data/glvol/brick1
> node1:/glvol   20130279808 18824658688 1305621120  94% /data/bigstor
> 
> brick info:
> xfs_info  /data/glvol/brick1
> meta-data=/dev/md126p4           isize=512    agcount=4, agsize=157344640
> blks
>           =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=629378560, imaxpct=5
>           =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=307313, version=2
>           =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> Here is the gluster info:
> [root at XXXXXX bigstor]# gluster
> gluster> volume info glvol
> 
> Volume Name: glvol
> Type: Stripe
> Volume ID: 53b2f6ad-46a6-4359-acad-dc5b6687d535
> Status: Started
> Number of Bricks: 1 x 8 = 8
> Transport-type: tcp
> Bricks:
> Brick1: node1:/data/glvol/brick1/brick
> Brick2: node2:/data/glvol/brick1/brick
> Brick3: node3:/data/glvol/brick1/brick
> Brick4: node4:/data/glvol/brick1/brick
> Brick5: node5:/data/glvol/brick1/brick
> Brick6: node6:/data/glvol/brick1/brick
> Brick7: node7:/data/glvol/brick1/brick
> Brick8: node8:/data/glvol/brick1/brick
> 
> The network I use is the ip over infiniband with very high throughput.
> 
> I also saw the discussion here on similar issue:
> http://supercolony.gluster.org/pipermail/gluster-users/2013-February/035560.html
> but it was blamed on ext4.
> 
> Thanks in advance,
>  	Sergey
> 
> PS I also looked at the contents of the
> /var/lib/glusterd/vols/glvol/glvol-fuse.vol
> and saw this, I don't know whether that's relevant or not
> volume glvol-client-0
>      type protocol/client
>      option transport-type tcp
>      option remote-subvolume /data/glvol/brick1/brick
>      option remote-host node1
> end-volume
> ......
> volume glvol-stripe-0
>      type cluster/stripe
>      subvolumes glvol-client-0 glvol-client-1 glvol-client-2 glvol-client-3
> glvol-client-4 glvol-client-5 glvol-client-6 glvol-client-7
> end-volume
> 
> volume glvol-dht
>      type cluster/distribute
>      subvolumes glvol-stripe-0
> end-volume
> 
> volume glvol-write-behind
>      type performance/write-behind
>      subvolumes glvol-dht
> end-volume
> 
> volume glvol-read-ahead
>      type performance/read-ahead
>      subvolumes glvol-write-behind
> end-volume
> 
> volume glvol-io-cache
>      type performance/io-cache
>      subvolumes glvol-read-ahead
> end-volume
> 
> volume glvol-quick-read
>      type performance/quick-read
>      subvolumes glvol-io-cache
> end-volume
> 
> volume glvol-open-behind
>      type performance/open-behind
>      subvolumes glvol-quick-read
> end-volume
> 
> volume glvol-md-cache
>      type performance/md-cache
>      subvolumes glvol-open-behind
> end-volume
> 
> volume glvol
>      type debug/io-stats
>      option count-fop-hits off
>      option latency-measurement off
>      subvolumes glvol-md-cache
> end-volume
> 

To isolate the problem, you could get rid of quick-read, open-behind, io-cache translators which are not helping you here and could possibly interfere.  This is the great thing about Gluster,  that you can chop out portions of the translator stack that easily.

> 
> *****************************************************
> Sergey E. Koposov, PhD, Senior Research Associate
> Institute of Astronomy, University of Cambridge
> Madingley road, CB3 0HA, Cambridge, UK
> Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
>