<div dir="ltr">Speaking of fio, could the gluster team please help me understand something?<div><br></div><div>We've been having lots of performance issues related to gluster using attached block storage on Linode. At some point, I figured out that Linode has a <a href="https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142">cap of 500 IOPS on their block storage</a> (with spikes to 1500 IOPS). The block storage we use is formatted xfs with 4KB bsize (block size). </div><div><br></div><div>I then ran a bunch of fio tests on the block storage itself (not the gluster fuse mount), which performed horribly when the bs parameter was set to 4k: </div><div><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">fio</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--randrepeat=1</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--ioengine=libaio</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--direct=1</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--gtod_reduce=1</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--name=test</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--filename=test</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--bs=4k</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--iodepth=64</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--size=4G</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--readwrite=randwrite</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)"> </span><span style="box-sizing:inherit;color:rgb(66,134,244);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">--ramp_time=4</span><span style="color:rgb(68,68,68);font-family:monospace;font-size:13px;white-space:pre-wrap;background-color:rgb(251,251,251)">
</span><br>During these tests, fio ETA crawled to over an hour, at some point dropped to 45min and I did see 500-1500 IOPS flash by briefly, then it went back down to 0. I/O seems majorly choked for some reason, likely because gluster is using some of it. Transfer speed with such 4k block size is 2 MB/s with spikes to 6MB/s. This causes the load on the server to spike up to 100+ and brings down all our servers.<pre style="box-sizing:inherit;background-color:rgb(251,251,251);font-size:1rem;line-height:1.3;overflow-x:auto;max-width:100%;color:rgb(96,100,105)"><code style="box-sizing:inherit;display:block;overflow-x:auto;padding:0.5em;color:rgb(68,68,68)"><span style="box-sizing:inherit">Jobs:</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">1</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">(f=1):</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">[w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">IOPS][eta</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">43m:00s]</span>
<span style="box-sizing:inherit">Jobs:</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">1</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">(f=1):</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">[w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">IOPS][eta</span> <span style="box-sizing:inherit;color:rgb(66,134,244)">44m:54s]</span>
</code></pre><pre style="box-sizing:inherit;background-color:rgb(251,251,251);font-size:1rem;line-height:1.3;overflow-x:auto;max-width:100%;color:rgb(96,100,105)"><code style="box-sizing:inherit;display:block;overflow-x:auto;padding:0.5em;color:rgb(68,68,68)">xfs_info /mnt/citadel_block1
meta-data=/dev/sdc isize=512 agcount=103, agsize=26214400 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0
= reflink=0
data = bsize=4096 blocks=2684354560, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
<span style="box-sizing:inherit;color:rgb(37,198,198)">log</span> =internal <span style="box-sizing:inherit;color:rgb(37,198,198)">log</span> bsize=4096 blocks=51200, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0</code></pre><div><div dir="ltr" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>When I increase the --bs param to fio from 4k to, say, 64k, transfer speed goes up significantly and is more like 50MB/s, and at 256k, it's 200MB/s.</div><div dir="ltr"><br></div><div>So what I'm trying to understand is:</div><div><ol><li>How does the xfs block size (4KB) relate to the block size in fio tests? If we're limited by IOPS, and xfs block size is 4KB, how can fio produce better results with varying --bs param? </li><li>Would increasing the xfs data block size to something like 64-256KB help with our issue of choking IO and skyrocketing load?</li><li>The worst hangs and load spikes happen when we reboot one of the gluster servers, but not when it's down - when it comes back online. Even with gluster not showing anything pending heal, my guess is it's still trying to do lots of IO between the 4 nodes for some reason, but I don't understand why.</li></ol></div><div>I've been banging my head on the wall with this problem for months. Appreciate any feedback here.</div><div dir="ltr"><br></div><div>Thank you.</div><div><br></div><div>gluster volume info below</div><div><pre style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,monospace;font-size:11.9px;margin-top:0px;margin-bottom:16px;max-height:none;overflow:auto;padding:16px;line-height:1.45;background-color:rgb(246,248,250);border-radius:6px;color:rgb(36,41,46)"><code style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,monospace;padding:0px;margin:0px;background:initial;border-radius:6px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit">Volume Name: SNIP_data1
Type: Replicate
Volume ID: SNIP
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1
Brick2: forge:/mnt/SNIP_block1/SNIP_data1
Brick3: hive:/mnt/SNIP_block1/SNIP_data1
Brick4: citadel:/mnt/SNIP_block1/SNIP_data1
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
network.ping-timeout: 5
network.remote-dio: enable
performance.rda-cache-limit: 256MB
performance.readdir-ahead: on
performance.parallel-readdir: on
network.inode-lru-limit: 500000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.readdir-optimize: on
performance.io-thread-count: 32
server.event-threads: 4
client.event-threads: 4
performance.read-ahead: off
cluster.lookup-optimize: on
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full</code></pre></div><div dir="ltr"><br>Sincerely,<br>Artem<br><br>--<br>Founder, <a href="http://www.androidpolice.com" target="_blank">Android Police</a>, <a href="http://www.apkmirror.com/" style="font-size:12.8px" target="_blank">APK Mirror</a><span style="font-size:12.8px">, Illogical Robot LLC</span></div><div dir="ltr"><a href="http://beerpla.net/" target="_blank">beerpla.net</a> | <a href="http://twitter.com/ArtemR" target="_blank">@ArtemR</a><br></div></div></div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <<a href="mailto:qw@g.clemson.edu" target="_blank">qw@g.clemson.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">Hi, </div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px"><br></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">I have one more question about the Gluster linear scale-out performance regarding the "write-behind off" case specifically -- when "write-behind" is off, and still the stripe volumes and other settings as early thread posted, the <span style="font-size:small">storage I/O seems not to relate to the number of storage nodes. In my experiment, no matter I have 2 brick server nodes or 8 brick server nodes, the aggregated gluster I/O performance is ~100MB/sec. And fio benchmark measurement gives the same result. If "write behind" is on, then the storage performance is linear scale-out along with the # of brick server nodes increasing. </span></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px"><span style="font-size:small"><br></span></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif">No matter the write behind option is on/off, I thought the gluster I/O performance should be pulled and aggregated together as a whole. If that is the case, why do I get a consistent gluster performance (~100MB/sec) when "write behind" is off? Please advise me if I misunderstood something. </div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif"><br></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif">Thanks,</div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif">Qing </div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif"><br></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif"><br></div></div></div></div></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <<a href="mailto:qw@g.clemson.edu" target="_blank">qw@g.clemson.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">fio gives me the correct linear scale-out results, and you're right, the storage cache is the root cause that makes the dd measurement results not accurate at all. </div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px"><br></div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">Thanks,</div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">Qing </div></div></div></div></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <<a href="mailto:ykaul@redhat.com" target="_blank">ykaul@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 21 Jul 2020, 21:43 Qing Wang <<a href="mailto:qw@g.clemson.edu" target="_blank">qw@g.clemson.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Yaniv,<div><br></div><div>Thanks for the quick response. I forget to mention I am testing the writing performance, not reading. In this case, would the client cache hit rate still be a big issue? </div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">It's not hitting the storage directly. Since it's also single threaded, it may also not saturate it. I highly recommend testing properly. </div><div dir="auto">Y. </div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>I'll use fio to run my test once again, thanks for the suggestion. </div><div><br></div><div>Thanks,</div><div>Qing </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul <<a href="mailto:ykaul@redhat.com" rel="noreferrer" target="_blank">ykaul@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 21 Jul 2020, 21:30 Qing Wang <<a href="mailto:qw@g.clemson.edu" rel="noreferrer" target="_blank">qw@g.clemson.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px">Hi, </div><div style="margin:0px;padding:0px;border:0px;font-family:Arial,Helvetica,sans-serif;font-size:13px"><br></div><div style="margin:0px;padding:0px;border:0px">I am trying to test Gluster linear scale-out performance by adding more storage server/bricks, and measure the storage I/O performance. To vary the storage server number, I create several "stripe" volumes that contain 2 brick servers, 3 brick servers, 4 brick servers, and so on. On gluster client side, I used "dd if=/dev/zero of=/mnt/glusterfs/dns_test_data_26g bs=1M count=26000" to create 26G data (or larger size), and those data will be distributed to the corresponding gluster servers (each has gluster brick on it) and "dd" returns the final I/O throughput. The Internet is 40G infiniband, although I didn't do any specific configurations to use advanced features. <br></div></div></div></div></div></div></div></div></div></div></div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Your dd command is inaccurate, as it'll hit the client cache. It is also single threaded. I suggest switching to fio. </div><div dir="auto">Y. </div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="margin:0px;padding:0px;border:0px"></div><div style="margin:0px;padding:0px;border:0px"><br></div><div style="margin:0px;padding:0px;border:0px">What confuses me is that the storage I/O seems not to relate to the number of storage nodes, but Gluster documents said it should be linear scaling. For example, when "write-behind" is on, and when Infiniband "jumbo frame" (connected mode) is on, I can get ~800 MB/sec reported by "dd", no matter I have 2 brick servers or 8 brick servers -- for 2 server case, each server can have ~400 MB/sec; for 4 server case, each server can have ~200MB/sec. That said, each server I/O does aggregate to the final storage I/O (800 MB/sec), but this is not "linear scale-out". </div><div style="margin:0px;padding:0px;border:0px"><br></div><div style="margin:0px;padding:0px;border:0px">Can somebody help me to understand why this is the case? I certainly can have some misunderstanding/misconfiguration here. Please correct me if I do, thanks! </div><div style="margin:0px;padding:0px;border:0px"><br></div><div style="margin:0px;padding:0px;border:0px">Best,</div><div style="margin:0px;padding:0px;border:0px">Qing</div></div></div></div></div></div></div></div></div></div></div></div>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer noreferrer noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>
<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" rel="noreferrer noreferrer" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote></div></div></div>
</blockquote></div>
</blockquote></div></div></div>
</blockquote></div>
</blockquote></div>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>
<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote></div>