[Gluster-users] Performance tuning suggestions for nvme on aws

Sun Jan 5 01:05:26 UTC 2020

Hi all!

I'm experimenting with GFS for the first time have built a simple
three-node cluster using AWS 'i3en' type instances. These instances provide
raw nvme devices that are incredibly fast.

What I'm finding in these tests is that gluster is offering only a fraction
of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick
each). I'm wondering if there is anything I can do to squeeze more
performance out.

For testing, I'm running fio using a 16GB test file with a 75/25 read/write
split. Basically I'm trying to replicate a MySQL database which is what I'd
ideally like to host here (which I realise is probably not practical).

My fio test command is:
$ fio --name=fio-test2 --filename=fio-test \
--randrepeat=1 \
--ioengine=libaio \
--direct=1 \
--runtime=300 \
--bs=16k \
--iodepth=64 \
--size=16G \
--readwrite=randrw \
--rwmixread=75 \
--group_reporting \
--numjobs=4

When I test this command directly on the nvme disk, I get:

   READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s),
io=47.0GiB (51.5GB), run=156806-156806msec
  WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s),
io=16.0GiB (17.2GB), run=156806-156806msec

When I install the disk into a gluster 3-replica volume, I get:

   READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s
(90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
  WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s
(30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec

If I do the same but with only 2 replicas, I get the same performance
results. I also get the same rough values when doing 'read',
'randread', 'write', and 'randwrite' tests.

I'm testing directly on one of the storage nodes, so there's no
variables line client/server network performance in the mix.

I ran the same test with EBS volumes and I saw similar performance
drops when offering up the volume using gluster. A "Provisioned IOPS"
EBS volume that could offer 10,000 IOPS directly, was getting only
about 3500 IOPS when running as part of a gluster volume.

We're using TLS on the management and volume connections, but I'm not
seeing any CPU or memory constraint when using these volumes, so I
don't believe that is the bottleneck. Similarly, when I try with SSL
turned off, I see no change in performance.

Does anyone have any suggestions on things I might try to increase
performance when using these very fast disks as part of a gluster
volume, or is this is to be expected when factoring in all the extra
work that gluster needs to do when replicating data around volumes?

Thanks very much for your time!! I'll put the two full fio outputs
below if anyone wants more details.

Mike

- First full fio test, nvme device without gluster

fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan  4 23:09:18 2020
  read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec)
    slat (usec): min=3, max=6476, avg=88.44, stdev=326.96
    clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14
     lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88
    clat percentiles (usec):
     |  1.00th=[ 3654],  5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159],
     | 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338],
     | 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960],
     | 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200],
     | 99.99th=[36439]
   bw (  KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59,
stdev=9421.58, samples=1252
   iops        : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252
  write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets
    slat (usec): min=4, max=5587, avg=88.52, stdev=325.86
    clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06
     lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50
    clat percentiles (usec):
     |  1.00th=[  947],  5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458],
     | 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686],
     | 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980],
     | 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600],
     | 99.99th=[13698]
   bw (  KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24,
stdev=3375.69, samples=1252
   iops        : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10%
  lat (msec)   : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s),
io=47.0GiB (51.5GB), run=156806-156806msec
  WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s),
io=16.0GiB (17.2GB), run=156806-156806msec

Disk stats (read/write):
    dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412,
in_queue=40257384, util=100.00%, aggrios=3456894/1155354,
aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192,
aggrutil=99.99%
    dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972,
in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0,
aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00%
    dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184,
in_queue=40244216, util=100.00%, aggrios=3143273/1051086,
aggrmerge=313621/104268, aggrticks=32277972/3937619,
aggrin_queue=36289488, aggrutil=100.00%
  nvme0n1: ios=3143273/1051086, merge=313621/104268,
ticks=32277972/3937619, in_queue=36289488, util=100.00%
  dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

- Second full fio test, nvme device as part of a gluster volume

fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan  4 23:30:28 2020
  read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec)
    slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87
    clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88
     lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
     | 30.00th=[    9], 40.00th=[    9], 50.00th=[   10], 60.00th=[   10],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   48], 95.00th=[  180],
     | 99.00th=[  642], 99.50th=[  860], 99.90th=[ 1435], 99.95th=[ 1687],
     | 99.99th=[ 2022]
   bw (  KiB/s): min=   31, max=93248, per=26.30%, avg=23247.24,
stdev=20716.86, samples=2280
   iops        : min=    1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280
  write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets
    slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86
    clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34
     lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57
    clat percentiles (usec):
     |  1.00th=[    889],  5.00th=[   2343], 10.00th=[   3654],
     | 20.00th=[   5276], 30.00th=[   5997], 40.00th=[   6456],
     | 50.00th=[   6849], 60.00th=[   7177], 70.00th=[   7504],
     | 80.00th=[   7963], 90.00th=[   8979], 95.00th=[  74974],
     | 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789],
     | 99.95th=[1518339], 99.99th=[1803551]
   bw (  KiB/s): min=   31, max=30240, per=27.05%, avg=8009.39,
stdev=6912.26, samples=2217
   iops        : min=    1, max= 1890, avg=500.56, stdev=432.02, samples=2217
  lat (usec)   : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08%
  lat (usec)   : 1000=0.11%
  lat (msec)   : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69%
  lat (msec)   : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31%
  cpu          : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s
(90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec
  WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s
(30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200105/1380d1d9/attachment.html>