<div dir="ltr">Hi all!<div><br></div><div>I'm experimenting with GFS for the first time have built a simple three-node cluster using AWS 'i3en' type instances. These instances provide raw nvme devices that are incredibly fast. </div><div><br></div><div>What I'm finding in these tests is that gluster is offering only a fraction of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick each). I'm wondering if there is anything I can do to squeeze more performance out. </div><div><br></div><div>For testing, I'm running fio using a 16GB test file with a 75/25 read/write split. Basically I'm trying to replicate a MySQL database which is what I'd ideally like to host here (which I realise is probably not practical). </div><div><br></div><div>My fio test command is: </div><div>$ fio --name=fio-test2 --filename=fio-test \<br>--randrepeat=1 \<br>--ioengine=libaio \<br>--direct=1 \<br>--runtime=300 \<br>--bs=16k \<br>--iodepth=64 \<br>--size=16G \<br>--readwrite=randrw \<br>--rwmixread=75 \<br>--group_reporting \<br>--numjobs=4<br></div><div><br></div><div>When I test this command directly on the nvme disk, I get: </div><div><pre><code> READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
</code></pre><pre><font face="arial, sans-serif">When I install the disk into a gluster 3-replica volume, I get:</font></pre><pre><code> READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
</code></pre><pre><font face="arial, sans-serif">If I do the same but with only 2 replicas, I get the same performance results. I also get the same rough values when doing 'read', 'randread', 'write', and 'randwrite' tests. </font></pre><pre><font face="arial, sans-serif">I'm testing directly on one of the storage nodes, so there's no variables line client/server network performance in the mix. </font></pre><pre><font face="arial, sans-serif">I ran the same test with EBS volumes and I saw similar performance drops when offering up the volume using gluster. A "Provisioned IOPS" EBS volume that could offer 10,000 IOPS directly, was getting only about 3500 IOPS when running as part of a gluster volume. </font></pre><pre><font face="arial, sans-serif">We're using TLS on the management and volume connections, but I'm not seeing any CPU or memory constraint when using these volumes, so I don't believe that is the bottleneck. Similarly, when I try with SSL turned off, I see no change in performance. </font></pre><pre><font face="arial, sans-serif">Does anyone have any suggestions on things I might try to increase performance when using these very fast disks as part of a gluster volume, or is this is to be expected when factoring in all the extra work that gluster needs to do when replicating data around volumes? </font></pre><pre><font face="arial, sans-serif">Thanks very much for your time!! I'll put the two full fio outputs below if anyone wants more details.</font></pre><pre><span style="font-family:arial,sans-serif">Mike</span><br></pre><pre><span style="font-family:arial,sans-serif"><br></span></pre><pre><span style="font-family:arial,sans-serif">- First full fio test, nvme device without gluster</span></pre><pre><code>fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan 4 23:09:18 2020
read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec)
slat (usec): min=3, max=6476, avg=88.44, stdev=326.96
clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14
lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88
clat percentiles (usec):
| 1.00th=[ 3654], 5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159],
| 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338],
| 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960],
| 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200],
| 99.99th=[36439]
bw ( KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59, stdev=9421.58, samples=1252
iops : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252
write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets
slat (usec): min=4, max=5587, avg=88.52, stdev=325.86
clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06
lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50
clat percentiles (usec):
| 1.00th=[ 947], 5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458],
| 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686],
| 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980],
| 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600],
| 99.99th=[13698]
bw ( KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24, stdev=3375.69, samples=1252
iops : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252
lat (usec) : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10%
lat (msec) : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06%
lat (msec) : 100=0.01%
cpu : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec
Disk stats (read/write):
dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412, in_queue=40257384, util=100.00%, aggrios=3456894/1155354, aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192, aggrutil=99.99%
dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972, in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0, aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00%
dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184, in_queue=40244216, util=100.00%, aggrios=3143273/1051086, aggrmerge=313621/104268, aggrticks=32277972/3937619, aggrin_queue=36289488, aggrutil=100.00%
nvme0n1: ios=3143273/1051086, merge=313621/104268, ticks=32277972/3937619, in_queue=36289488, util=100.00%
dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%</code></pre><pre><span style="font-family:arial,sans-serif">- Second full fio test, nvme device as part of a gluster volume</span></pre><pre><code>fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan 4 23:30:28 2020
read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec)
slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87
clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88
lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 10],
| 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 48], 95.00th=[ 180],
| 99.00th=[ 642], 99.50th=[ 860], 99.90th=[ 1435], 99.95th=[ 1687],
| 99.99th=[ 2022]
bw ( KiB/s): min= 31, max=93248, per=26.30%, avg=23247.24, stdev=20716.86, samples=2280
iops : min= 1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280
write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets
slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86
clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34
lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57
clat percentiles (usec):
| 1.00th=[ 889], 5.00th=[ 2343], 10.00th=[ 3654],
| 20.00th=[ 5276], 30.00th=[ 5997], 40.00th=[ 6456],
| 50.00th=[ 6849], 60.00th=[ 7177], 70.00th=[ 7504],
| 80.00th=[ 7963], 90.00th=[ 8979], 95.00th=[ 74974],
| 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789],
| 99.95th=[1518339], 99.99th=[1803551]
bw ( KiB/s): min= 31, max=30240, per=27.05%, avg=8009.39, stdev=6912.26, samples=2217
iops : min= 1, max= 1890, avg=500.56, stdev=432.02, samples=2217
lat (usec) : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08%
lat (usec) : 1000=0.11%
lat (msec) : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69%
lat (msec) : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31%
cpu : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
</code></pre></div></div>