[Gluster-users] Gluster linear scale-out performance

Mon Aug 3 17:54:24 UTC 2020

>
> Do you kill all gluster processes (not just glusterd but even the brick
> processes) before issuing reboot? This is necessary to prevent I/O stalls.
> There is stop-all-gluster-processes.sh which should be available as a part
> of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which
> you can use.  Can you check if this helps?
>
A reboot shuts down gracefully, so those processes are shut down before the
reboot begins.

We've moved on to discussing this matter in the gluster slack, there's a
lot more info there now about the above. The gist is heavy xfs
fragmentation when bricks are almost full (95-96%) made healing as well as
disk accesses a lot more expensive and slow, and prone to hanging.

What's still not clear is why a slowdown of one brick/gluster instance
affects similarly affects all bricks/gluster instances, on other servers,
and how that can be optimized/mitigated.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>

On Thu, Jul 30, 2020 at 8:21 PM Ravishankar N <ravishankar at redhat.com>
wrote:

>
> On 25/07/20 4:35 am, Artem Russakovskii wrote:
>
> Speaking of fio, could the gluster team please help me understand
> something?
>
> We've been having lots of performance issues related to gluster using
> attached block storage on Linode. At some point, I figured out that Linode
> has a cap of 500 IOPS on their block storage
> <https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>
> (with spikes to 1500 IOPS). The block storage we use is formatted xfs with
> 4KB bsize (block size).
>
> I then ran a bunch of fio tests on the block storage itself (not the
> gluster fuse mount), which performed horribly when the bs parameter was set
> to 4k:
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test --filename=test --bs=4k --iodepth=64 --size=4G
> --readwrite=randwrite --ramp_time=4
> During these tests, fio ETA crawled to over an hour, at some point dropped
> to 45min and I did see 500-1500 IOPS flash by briefly, then it went back
> down to 0. I/O seems majorly choked for some reason, likely because gluster
> is using some of it. Transfer speed with such 4k block size is 2 MB/s with
> spikes to 6MB/s. This causes the load on the server to spike up to 100+ and
> brings down all our servers.
>
> Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 IOPS][eta 43m:00s]    Jobs: 1 (f=1): [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s]
>
> xfs_info /mnt/citadel_block1
> meta-data=/dev/sdc               isize=512    agcount=103, agsize=26214400 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2684354560, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1log      =internal log           bsize=4096   blocks=51200, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> When I increase the --bs param to fio from 4k to, say, 64k, transfer speed
> goes up significantly and is more like 50MB/s, and at 256k, it's 200MB/s.
>
> So what I'm trying to understand is:
>
>    1. How does the xfs block size (4KB) relate to the block size in fio
>    tests? If we're limited by IOPS, and xfs block size is 4KB, how can fio
>    produce better results with varying --bs param?
>    2. Would increasing the xfs data block size to something like 64-256KB
>    help with our issue of choking IO and skyrocketing load?
>
> I have experienced similar behavior when running fio tests with bs=4k on a
> gluster volume backed by XFS with a high load (numjobs=32) . When I
> observed the strace of the brick processes (fsync -f -T -p $PID), I saw
> fysnc system calls taking around 2500 seconds which is insane. I'm not sure
> if this is specific to the way fio does its i/o pattern and the way XFS
> handles it. When I used 64k block sizes, the fio tests completed just fine.
>
>
>    1. The worst hangs and load spikes happen when we reboot one of the
>    gluster servers, but not when it's down - when it comes back online. Even
>    with gluster not showing anything pending heal, my guess is it's still
>    trying to do lots of IO between the 4 nodes for some reason, but I don't
>    understand why.
>
> Do you kill all gluster processes (not just glusterd but even the brick
> processes) before issuing reboot? This is necessary to prevent I/O stalls.
> There is stop-all-gluster-processes.sh which should be available as a part
> of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which
> you can use.  Can you check if this helps?
>
> Regards,
>
> Ravi
>
> I've been banging my head on the wall with this problem for months.
> Appreciate any feedback here.
>
> Thank you.
>
> gluster volume info below
>
> Volume Name: SNIP_data1
> Type: Replicate
> Volume ID: SNIP
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 4 = 4
> Transport-type: tcp
> Bricks:
> Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1
> Brick2: forge:/mnt/SNIP_block1/SNIP_data1
> Brick3: hive:/mnt/SNIP_block1/SNIP_data1
> Brick4: citadel:/mnt/SNIP_block1/SNIP_data1
> Options Reconfigured:
> cluster.quorum-count: 1
> cluster.quorum-type: fixed
> network.ping-timeout: 5
> network.remote-dio: enable
> performance.rda-cache-limit: 256MB
> performance.readdir-ahead: on
> performance.parallel-readdir: on
> network.inode-lru-limit: 500000
> performance.md-cache-timeout: 600
> performance.cache-invalidation: on
> performance.stat-prefetch: on
> features.cache-invalidation-timeout: 600
> features.cache-invalidation: on
> cluster.readdir-optimize: on
> performance.io-thread-count: 32
> server.event-threads: 4
> client.event-threads: 4
> performance.read-ahead: off
> cluster.lookup-optimize: on
> performance.cache-size: 1GB
> cluster.self-heal-daemon: enable
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
> cluster.granular-entry-heal: enable
> cluster.data-self-heal-algorithm: full
>
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
> On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu> wrote:
>
>> Hi,
>>
>> I have one more question about the Gluster linear scale-out performance
>> regarding the "write-behind off" case specifically -- when "write-behind"
>> is off, and still the stripe volumes and other settings as early thread
>> posted, the storage I/O seems not to relate to the number of storage
>> nodes. In my experiment, no matter I have 2 brick server nodes or 8 brick
>> server nodes, the aggregated gluster I/O performance is ~100MB/sec. And fio
>> benchmark measurement gives the same result. If "write behind" is on, then
>> the storage performance is linear scale-out along with the # of brick
>> server nodes increasing.
>>
>> No matter the write behind option is on/off, I thought the gluster I/O
>> performance should be pulled and aggregated together as a whole. If that is
>> the case, why do I get a consistent gluster performance (~100MB/sec) when
>> "write behind" is off? Please advise me if I misunderstood something.
>>
>> Thanks,
>> Qing
>>
>>
>>
>>
>> On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu> wrote:
>>
>>> fio gives me the correct linear scale-out results, and you're right, the
>>> storage cache is the root cause that makes the dd measurement results not
>>> accurate at all.
>>>
>>> Thanks,
>>> Qing
>>>
>>>
>>> On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at g.clemson.edu> wrote:
>>>>
>>>>> Hi Yaniv,
>>>>>
>>>>> Thanks for the quick response. I forget to mention I am testing the
>>>>> writing performance, not reading. In this case, would the client cache hit
>>>>> rate still be a big issue?
>>>>>
>>>>
>>>> It's not hitting the storage directly. Since it's also single threaded,
>>>> it may also not saturate it. I highly recommend testing properly.
>>>> Y.
>>>>
>>>>
>>>>> I'll use fio to run my test once again, thanks for the suggestion.
>>>>>
>>>>> Thanks,
>>>>> Qing
>>>>>
>>>>> On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul <ykaul at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 21 Jul 2020, 21:30 Qing Wang <qw at g.clemson.edu> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to test Gluster linear scale-out performance by adding
>>>>>>> more storage server/bricks, and measure the storage I/O performance. To
>>>>>>> vary the storage server number, I create several "stripe" volumes that
>>>>>>> contain 2 brick servers, 3 brick servers, 4 brick servers, and so on. On
>>>>>>> gluster client side, I used "dd if=/dev/zero
>>>>>>> of=/mnt/glusterfs/dns_test_data_26g bs=1M count=26000" to create 26G data
>>>>>>> (or larger size), and those data will be distributed to the corresponding
>>>>>>> gluster servers (each has gluster brick on it) and "dd" returns the final
>>>>>>> I/O throughput. The Internet is 40G infiniband, although I didn't do any
>>>>>>> specific configurations to use advanced features.
>>>>>>>
>>>>>>
>>>>>> Your dd command is inaccurate, as it'll hit the client cache. It is
>>>>>> also single threaded. I suggest switching to fio.
>>>>>> Y.
>>>>>>
>>>>>>
>>>>>>> What confuses me is that the storage I/O seems not to relate to the
>>>>>>> number of storage nodes, but Gluster documents said it should be linear
>>>>>>> scaling. For example, when "write-behind" is on, and when Infiniband "jumbo
>>>>>>> frame" (connected mode) is on, I can get ~800 MB/sec reported by "dd", no
>>>>>>> matter I have 2 brick servers or 8 brick servers -- for 2 server case, each
>>>>>>> server can have ~400 MB/sec; for 4 server case, each server can have
>>>>>>> ~200MB/sec. That said, each server I/O does aggregate to the final storage
>>>>>>> I/O (800 MB/sec), but this is not "linear scale-out".
>>>>>>>
>>>>>>> Can somebody help me to understand why this is the case? I certainly
>>>>>>> can have some misunderstanding/misconfiguration here. Please correct me if
>>>>>>> I do, thanks!
>>>>>>>
>>>>>>> Best,
>>>>>>> Qing
>>>>>>> ________
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Community Meeting Calendar:
>>>>>>>
>>>>>>> Schedule -
>>>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>>>> Bridge: https://bluejeans.com/441850968
>>>>>>>
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200803/c854e0e1/attachment.html>