[Gluster-users] Gluster linear scale-out performance

Fri Jul 31 03:20:32 UTC 2020

On 25/07/20 4:35 am, Artem Russakovskii wrote:
> Speaking of fio, could the gluster team please help me understand 
> something?
>
> We've been having lots of performance issues related to gluster using 
> attached block storage on Linode. At some point, I figured out that 
> Linode has a cap of 500 IOPS on their block storage 
> <https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142> 
> (with spikes to 1500 IOPS). The block storage we use is formatted xfs 
> with 4KB bsize (block size).
>
> I then ran a bunch of fio tests on the block storage itself (not the 
> gluster fuse mount), which performed horribly when the bs parameter 
> was set to 4k:
> fio--randrepeat=1--ioengine=libaio--direct=1--gtod_reduce=1--name=test--filename=test--bs=4k--iodepth=64--size=4G--readwrite=randwrite--ramp_time=4
> During these tests, fio ETA crawled to over an hour, at some point 
> dropped to 45min and I did see 500-1500 IOPS flash by briefly, then it 
> went back down to 0. I/O seems majorly choked for some reason, likely 
> because gluster is using some of it. Transfer speed with such 4k block 
> size is 2 MB/s with spikes to 6MB/s. This causes the load on the 
> server to spike up to 100+ and brings down all our servers.
> |Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 
> IOPS][eta 43m:00s] Jobs: 1 (f=1): 
> [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s] |
> |xfs_info /mnt/citadel_block1 meta-data=/dev/sdc isize=512 
> agcount=103, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 = 
> crc=1 finobt=1, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096 
> blocks=2684354560, imaxpct=25 = sunit=0 swidth=0 blks naming =version 
> 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 
> blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 
> realtime =none extsz=4096 blocks=0, rtextents=0|
> When I increase the --bs param to fio from 4k to, say, 64k, transfer 
> speed goes up significantly and is more like 50MB/s, and at 256k, it's 
> 200MB/s.
>
> So what I'm trying to understand is:
>
>  1. How does the xfs block size (4KB) relate to the block size in fio
>     tests? If we're limited by IOPS, and xfs block size is 4KB, how
>     can fio produce better results with varying --bs param?
>  2. Would increasing the xfs data block size to something like
>     64-256KB help with our issue of choking IO and skyrocketing load?
>
I have experienced similar behavior when running fio tests with bs=4k on 
a gluster volume backed by XFS with a high load (numjobs=32) . When I 
observed the strace of the brick processes (fsync -f -T -p $PID), I saw 
fysnc system calls taking around 2500 seconds which is insane. I'm not 
sure if this is specific to the way fio does its i/o pattern and the way 
XFS handles it. When I used 64k block sizes, the fio tests completed 
just fine.
>
>  1. The worst hangs and load spikes happen when we reboot one of the
>     gluster servers, but not when it's down - when it comes back
>     online. Even with gluster not showing anything pending heal, my
>     guess is it's still trying to do lots of IO between the 4 nodes
>     for some reason, but I don't understand why.
>
Do you kill all gluster processes (not just glusterd but even the brick 
processes) before issuing reboot? This is necessary to prevent I/O 
stalls. There is stop-all-gluster-processes.sh which should be available 
as a part of the gluster installation (maybe in 
/usr/share/glusterfs/scripts/) which you can use.  Can you check if this 
helps?

Regards,

Ravi

> I've been banging my head on the wall with this problem for months. 
> Appreciate any feedback here.
>
> Thank you.
>
> gluster volume info below
> |Volume Name: SNIP_data1 Type: Replicate Volume ID: SNIP Status: 
> Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: 
> tcp Bricks: Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1 Brick2: 
> forge:/mnt/SNIP_block1/SNIP_data1 Brick3: 
> hive:/mnt/SNIP_block1/SNIP_data1 Brick4: 
> citadel:/mnt/SNIP_block1/SNIP_data1 Options Reconfigured: 
> cluster.quorum-count: 1 cluster.quorum-type: fixed 
> network.ping-timeout: 5 network.remote-dio: enable 
> performance.rda-cache-limit: 256MB performance.readdir-ahead: on 
> performance.parallel-readdir: on network.inode-lru-limit: 500000 
> performance.md-cache-timeout: 600 performance.cache-invalidation: on 
> performance.stat-prefetch: on features.cache-invalidation-timeout: 600 
> features.cache-invalidation: on cluster.readdir-optimize: on 
> performance.io-thread-count: 32 server.event-threads: 4 
> client.event-threads: 4 performance.read-ahead: off 
> cluster.lookup-optimize: on performance.cache-size: 1GB 
> cluster.self-heal-daemon: enable transport.address-family: inet 
> nfs.disable: on performance.client-io-threads: on 
> cluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full|
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror 
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net <http://beerpla.net/> | @ArtemR <http://twitter.com/ArtemR>
>
>
> On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu 
> <mailto:qw at g.clemson.edu>> wrote:
>
>     Hi,
>
>     I have one more question about the Gluster linear scale-out
>     performance regarding the "write-behind off" case specifically --
>     when "write-behind" is off, and still the stripe volumes and other
>     settings as early thread posted, the storage I/O seems not to
>     relate to the number of storage nodes. In my experiment, no matter
>     I have 2 brick server nodes or 8 brick server nodes, the
>     aggregated gluster I/O performance is ~100MB/sec. And fio
>     benchmark measurement gives the same result. If "write behind" is
>     on, then the storage performance is linear scale-out along with
>     the # of brick server nodes increasing.
>
>     No matter the write behind option is on/off, I thought the gluster
>     I/O performance should be pulled and aggregated together as a
>     whole. If that is the case, why do I get a consistent gluster
>     performance (~100MB/sec) when "write behind" is off? Please advise
>     me if I misunderstood something.
>
>     Thanks,
>     Qing
>
>
>
>
>     On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu
>     <mailto:qw at g.clemson.edu>> wrote:
>
>         fio gives me the correct linear scale-out results, and you're
>         right, the storage cache is the root cause that makes the dd
>         measurement results not accurate at all.
>
>         Thanks,
>         Qing
>
>
>         On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at redhat.com
>         <mailto:ykaul at redhat.com>> wrote:
>
>
>
>             On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at g.clemson.edu
>             <mailto:qw at g.clemson.edu>> wrote:
>
>                 Hi Yaniv,
>
>                 Thanks for the quick response. I forget to mention I
>                 am testing the writing performance, not reading. In
>                 this case, would the client cache hit rate still be a
>                 big issue?
>
>
>             It's not hitting the storage directly. Since it's also
>             single threaded, it may also not saturate it. I highly
>             recommend testing properly.
>             Y.
>
>
>                 I'll use fio to run my test once again, thanks for the
>                 suggestion.
>
>                 Thanks,
>                 Qing
>
>                 On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul
>                 <ykaul at redhat.com <mailto:ykaul at redhat.com>> wrote:
>
>
>
>                     On Tue, 21 Jul 2020, 21:30 Qing Wang
>                     <qw at g.clemson.edu <mailto:qw at g.clemson.edu>> wrote:
>
>                         Hi,
>
>                         I am trying to test Gluster linear scale-out
>                         performance by adding more storage
>                         server/bricks, and measure the storage I/O
>                         performance. To vary the storage server
>                         number, I create several "stripe" volumes that
>                         contain 2 brick servers, 3 brick servers, 4
>                         brick servers, and so on. On gluster client
>                         side, I used "dd if=/dev/zero
>                         of=/mnt/glusterfs/dns_test_data_26g bs=1M
>                         count=26000" to create 26G data (or larger
>                         size), and those data will be distributed to
>                         the corresponding gluster servers (each has
>                         gluster brick on it) and "dd" returns the
>                         final I/O throughput. The Internet is 40G
>                         infiniband, although I didn't do any specific
>                         configurations to use advanced features.
>
>
>                     Your dd command is inaccurate, as it'll hit the
>                     client cache. It is also single threaded. I
>                     suggest switching to fio.
>                     Y.
>
>
>                         What confuses me is that the storage I/O seems
>                         not to relate to the number of storage
>                         nodes, but Gluster documents said it should be
>                         linear scaling. For example, when
>                         "write-behind" is on, and when Infiniband
>                         "jumbo frame" (connected mode) is on, I can
>                         get ~800 MB/sec reported by "dd", no matter I
>                         have 2 brick servers or 8 brick servers -- for
>                         2 server case, each server can have ~400
>                         MB/sec; for 4 server case, each server can
>                         have ~200MB/sec. That said, each server I/O
>                         does aggregate to the final storage I/O (800
>                         MB/sec), but this is not "linear scale-out".
>
>                         Can somebody help me to understand why this is
>                         the case? I certainly can have some
>                         misunderstanding/misconfiguration here. Please
>                         correct me if I do, thanks!
>
>                         Best,
>                         Qing
>                         ________
>
>
>
>                         Community Meeting Calendar:
>
>                         Schedule -
>                         Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>                         Bridge: https://bluejeans.com/441850968
>
>                         Gluster-users mailing list
>                         Gluster-users at gluster.org
>                         <mailto:Gluster-users at gluster.org>
>                         https://lists.gluster.org/mailman/listinfo/gluster-users
>
>     ________
>
>
>
>     Community Meeting Calendar:
>
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://bluejeans.com/441850968
>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200731/582cd5b1/attachment.html>