[Bugs] [Bug 1467614] Gluster read/write performance improvements on NVMe backend

Thu Oct 26 11:35:37 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1467614

--- Comment #35 from Manoj Pillai <mpillai at redhat.com> ---
Did some tests to understand the iops limits on the client-side and on the
brick-side. Updating the bz with the results.

Results are based on: glusterfs-3.12.1-2.el7.x86_64

First, given that NVMe equipped servers are not that easy to get in our lab,
wanted to see if I could use a ram disk instead as a fast brick device to
identify bottlenecks in the gluster software stack. That works well, IMO. Runs
here were done with brick on ramdisk, and results are similar to what Krutika
has been reporting on on NVMe drive.

Server-side setup:
# create the ramdisk of size 16g:
modprobe brd rd_nr=1 rd_size=16777216 max_part=0
# check with "ls -l /dev/ram*"

# create a single-brick volume on ramdisk
mkfs.xfs -f -i size=512 /dev/ram0
mount -t xfs /dev/ram0 /mnt/rhs_brick1

gluster v create perfvol ${server}:/mnt/rhs_brick1 force
gluster v start perfvol

Since this is random I/O test, applying with these settings:
performance.strict-o-direct: on
network.remote-dio: disable
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.io-cache: off
performance.write-behind: off
performance.client-io-threads: on
performance.io-thread-count: 4
performance.read-ahead: off

I have upto 4 clients that connect to the server over a 10GbE link.

Test steps:
* mount the volume on each client on /mnt/glustervol
* create a separate directory for each client: /mnt/glustervol/<hostname>
* prepare for fio dist test: "fio --server --daemonize=/var/run/fio-svr.pid"
* create data set using fio seq. write test
* sync and drop caches all around
* perform fio randread test

Single-client results:

command and job file for fio seq. write test to create a 12g data set:

fio --output=out.fio.write --client=<hosts file> job.fio.write

[global]
rw=write
create_on_open=1
fsync_on_close=1
bs=1024k
startdelay=0
ioengine=sync

[seqwrite]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=24
nrfiles=1
openfiles=1
filesize=512m
size=512m

job file for fio randread test (reads 25% of previously written data):
[global]
rw=randread
startdelay=0
ioengine=sync
direct=1
bs=4k

[randread]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=24
nrfiles=1
openfiles=1
filesize=512m
size=512m
io_size=128m

Result for above case (24 concurrent jobs):
read: IOPS=22.9k, BW=89.5Mi (93.9M)(3072MiB/34309msec)

The IOPS number is in the ball park of what was reported in comment #27 for the
single-client, single-brick case.

Result for 48 concurrent jobs:
[changed job file so numjobs=48, but filesize is proportionately reduced so
that total data set size is 12g. Note: the brick is a 16g ram disk]
read: IOPS=23.1k, BW=90.4Mi (94.8M)(3072MiB/33985msec)

So we are getting only a fraction of the iops that the brick device is capable
of and increasing number of jobs doesn't help -- IOPS stays at close to 23k.
Tuning event-threads, io-thread-count does not help -- IOPS stays stuck at
around 23k.

Question is, is the bottleneck on the brick-side or the client-side? Probably
on the client-side. To test that, we'll run the test with 4 clients. If the
brick is the bottleneck, IOPs should stay the same. If not, IOPs should
increase, ideally to 4x.

Result with 4 clients, single brick:
[fio filesize, size and io_size are adjusted to keep the data set to 12g]
read: IOPS=66.4k, BW=259Mi (272M)(3072MiB/11845msec)
For this above test, I did change io-thread-count=8.

So, IOPs increases but not 4x. This tells me that the bottleneck responsible
for 23k IOPs limit in the single-client, single-brick test is on the client
side. There is a bottleneck at the brick as well, but we hit it at a higher
IOPs. Had there been no bottleneck on the brick, the IOPs on the 4 client test
would have been close to 90k. To confirm that, I did a test, but this time with
2 bricks (each on a separate ramdisk).

Results with 4 clients, 2 brick distribute volume):
[io-thread-count is set to 8 in this test, with io-thread-count at 4, I was
getting only about 76k IOPs]
read: IOPS=87.8k, BW=343Mi (360M)(3072MiB/8953msec)

To summarize, from these tests it looks like the gluster stack has fairly low
limits on how many IOPs it can push through, on both client and brick sides:

fuse-client limit: 23k IOPs
brick limit: 66k IOPs

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=2Woj3uMKXu&a=cc_unsubscribe