[Gluster-users] Gluster linear scale-out performance
Ravishankar N
ravishankar at redhat.com
Fri Jul 31 03:20:32 UTC 2020
On 25/07/20 4:35 am, Artem Russakovskii wrote:
> Speaking of fio, could the gluster team please help me understand
> something?
>
> We've been having lots of performance issues related to gluster using
> attached block storage on Linode. At some point, I figured out that
> Linode has a cap of 500 IOPS on their block storage
> <https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>
> (with spikes to 1500 IOPS). The block storage we use is formatted xfs
> with 4KB bsize (block size).
>
> I then ran a bunch of fio tests on the block storage itself (not the
> gluster fuse mount), which performed horribly when the bs parameter
> was set to 4k:
> fio--randrepeat=1--ioengine=libaio--direct=1--gtod_reduce=1--name=test--filename=test--bs=4k--iodepth=64--size=4G--readwrite=randwrite--ramp_time=4
> During these tests, fio ETA crawled to over an hour, at some point
> dropped to 45min and I did see 500-1500 IOPS flash by briefly, then it
> went back down to 0. I/O seems majorly choked for some reason, likely
> because gluster is using some of it. Transfer speed with such 4k block
> size is 2 MB/s with spikes to 6MB/s. This causes the load on the
> server to spike up to 100+ and brings down all our servers.
> |Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477
> IOPS][eta 43m:00s] Jobs: 1 (f=1):
> [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s] |
> |xfs_info /mnt/citadel_block1 meta-data=/dev/sdc isize=512
> agcount=103, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 =
> crc=1 finobt=1, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096
> blocks=2684354560, imaxpct=25 = sunit=0 swidth=0 blks naming =version
> 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096
> blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0|
> When I increase the --bs param to fio from 4k to, say, 64k, transfer
> speed goes up significantly and is more like 50MB/s, and at 256k, it's
> 200MB/s.
>
> So what I'm trying to understand is:
>
> 1. How does the xfs block size (4KB) relate to the block size in fio
> tests? If we're limited by IOPS, and xfs block size is 4KB, how
> can fio produce better results with varying --bs param?
> 2. Would increasing the xfs data block size to something like
> 64-256KB help with our issue of choking IO and skyrocketing load?
>
I have experienced similar behavior when running fio tests with bs=4k on
a gluster volume backed by XFS with a high load (numjobs=32) . When I
observed the strace of the brick processes (fsync -f -T -p $PID), I saw
fysnc system calls taking around 2500 seconds which is insane. I'm not
sure if this is specific to the way fio does its i/o pattern and the way
XFS handles it. When I used 64k block sizes, the fio tests completed
just fine.
>
> 1. The worst hangs and load spikes happen when we reboot one of the
> gluster servers, but not when it's down - when it comes back
> online. Even with gluster not showing anything pending heal, my
> guess is it's still trying to do lots of IO between the 4 nodes
> for some reason, but I don't understand why.
>
Do you kill all gluster processes (not just glusterd but even the brick
processes) before issuing reboot? This is necessary to prevent I/O
stalls. There is stop-all-gluster-processes.sh which should be available
as a part of the gluster installation (maybe in
/usr/share/glusterfs/scripts/) which you can use. Can you check if this
helps?
Regards,
Ravi
> I've been banging my head on the wall with this problem for months.
> Appreciate any feedback here.
>
> Thank you.
>
> gluster volume info below
> |Volume Name: SNIP_data1 Type: Replicate Volume ID: SNIP Status:
> Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type:
> tcp Bricks: Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1 Brick2:
> forge:/mnt/SNIP_block1/SNIP_data1 Brick3:
> hive:/mnt/SNIP_block1/SNIP_data1 Brick4:
> citadel:/mnt/SNIP_block1/SNIP_data1 Options Reconfigured:
> cluster.quorum-count: 1 cluster.quorum-type: fixed
> network.ping-timeout: 5 network.remote-dio: enable
> performance.rda-cache-limit: 256MB performance.readdir-ahead: on
> performance.parallel-readdir: on network.inode-lru-limit: 500000
> performance.md-cache-timeout: 600 performance.cache-invalidation: on
> performance.stat-prefetch: on features.cache-invalidation-timeout: 600
> features.cache-invalidation: on cluster.readdir-optimize: on
> performance.io-thread-count: 32 server.event-threads: 4
> client.event-threads: 4 performance.read-ahead: off
> cluster.lookup-optimize: on performance.cache-size: 1GB
> cluster.self-heal-daemon: enable transport.address-family: inet
> nfs.disable: on performance.client-io-threads: on
> cluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full|
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net <http://beerpla.net/> | @ArtemR <http://twitter.com/ArtemR>
>
>
> On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu
> <mailto:qw at g.clemson.edu>> wrote:
>
> Hi,
>
> I have one more question about the Gluster linear scale-out
> performance regarding the "write-behind off" case specifically --
> when "write-behind" is off, and still the stripe volumes and other
> settings as early thread posted, the storage I/O seems not to
> relate to the number of storage nodes. In my experiment, no matter
> I have 2 brick server nodes or 8 brick server nodes, the
> aggregated gluster I/O performance is ~100MB/sec. And fio
> benchmark measurement gives the same result. If "write behind" is
> on, then the storage performance is linear scale-out along with
> the # of brick server nodes increasing.
>
> No matter the write behind option is on/off, I thought the gluster
> I/O performance should be pulled and aggregated together as a
> whole. If that is the case, why do I get a consistent gluster
> performance (~100MB/sec) when "write behind" is off? Please advise
> me if I misunderstood something.
>
> Thanks,
> Qing
>
>
>
>
> On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu
> <mailto:qw at g.clemson.edu>> wrote:
>
> fio gives me the correct linear scale-out results, and you're
> right, the storage cache is the root cause that makes the dd
> measurement results not accurate at all.
>
> Thanks,
> Qing
>
>
> On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at redhat.com
> <mailto:ykaul at redhat.com>> wrote:
>
>
>
> On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at g.clemson.edu
> <mailto:qw at g.clemson.edu>> wrote:
>
> Hi Yaniv,
>
> Thanks for the quick response. I forget to mention I
> am testing the writing performance, not reading. In
> this case, would the client cache hit rate still be a
> big issue?
>
>
> It's not hitting the storage directly. Since it's also
> single threaded, it may also not saturate it. I highly
> recommend testing properly.
> Y.
>
>
> I'll use fio to run my test once again, thanks for the
> suggestion.
>
> Thanks,
> Qing
>
> On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul
> <ykaul at redhat.com <mailto:ykaul at redhat.com>> wrote:
>
>
>
> On Tue, 21 Jul 2020, 21:30 Qing Wang
> <qw at g.clemson.edu <mailto:qw at g.clemson.edu>> wrote:
>
> Hi,
>
> I am trying to test Gluster linear scale-out
> performance by adding more storage
> server/bricks, and measure the storage I/O
> performance. To vary the storage server
> number, I create several "stripe" volumes that
> contain 2 brick servers, 3 brick servers, 4
> brick servers, and so on. On gluster client
> side, I used "dd if=/dev/zero
> of=/mnt/glusterfs/dns_test_data_26g bs=1M
> count=26000" to create 26G data (or larger
> size), and those data will be distributed to
> the corresponding gluster servers (each has
> gluster brick on it) and "dd" returns the
> final I/O throughput. The Internet is 40G
> infiniband, although I didn't do any specific
> configurations to use advanced features.
>
>
> Your dd command is inaccurate, as it'll hit the
> client cache. It is also single threaded. I
> suggest switching to fio.
> Y.
>
>
> What confuses me is that the storage I/O seems
> not to relate to the number of storage
> nodes, but Gluster documents said it should be
> linear scaling. For example, when
> "write-behind" is on, and when Infiniband
> "jumbo frame" (connected mode) is on, I can
> get ~800 MB/sec reported by "dd", no matter I
> have 2 brick servers or 8 brick servers -- for
> 2 server case, each server can have ~400
> MB/sec; for 4 server case, each server can
> have ~200MB/sec. That said, each server I/O
> does aggregate to the final storage I/O (800
> MB/sec), but this is not "linear scale-out".
>
> Can somebody help me to understand why this is
> the case? I certainly can have some
> misunderstanding/misconfiguration here. Please
> correct me if I do, thanks!
>
> Best,
> Qing
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> <mailto:Gluster-users at gluster.org>
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200731/582cd5b1/attachment.html>
More information about the Gluster-users
mailing list