[Bugs] [Bug 1467614] Gluster read/write performance improvements on NVMe backend

Thu Dec 21 06:19:50 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1467614

--- Comment #52 from Manoj Pillai <mpillai at redhat.com> ---

We [Raghavendra, Krutika, Mohit, Milind, me] met up and Raghavendra went over
the rpc layer details so we could brainstorm on potential inefficiencies that
can explain the current low limit on IOPS on the client side.  We discussed the
possibility that the current iops limit may be due to delay in adding back the
socket for polling. Similar to https://review.gluster.org/#/c/17391/, but now
the delay being due to socket_event_handler () in rpc/rpc-transport/socket.

One suggestion from Milind was to have multiple connections between client and
brick. We can test this hypothesis with a plain distribute volume with multiple
bricks, each on NVMe SSDs; if we see that going from single brick to multiple
bricks improves IOPS, multiple connections per brick could be promising to
have. IIRC, earlier tests with multiple bricks were not effective in scaling
IOPS. However, would be good to check whether recent improvements have changed
that.

The short of it is, yes, I saw IOPS go up from 34k with a single brick to 45.7k
with a 3-brick distribute volume. That is a good bump in IOPS, maybe better
than any single bump we have seen so far. At 45.7k, on my setup, the fuse
thread looks to a bottleneck at 98% CPU utilization, so the gain could
potentially have been higher. [In order to get this higher throughput, I had to
bump up the no. of concurrent requests to 48].

Details::

verison:
glusterfs-3.13.0-1.el7.x86_64

fio job file for random read:
[global]
rw=randread
exitall
startdelay=0
ioengine=sync
direct=1
bs=4k

[randread]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
iodepth=1
numjobs=48
nrfiles=1
openfiles=1
filesize=5g
size=5g
io_size=512m

results:
single-brick: read: IOPS=34.0k, BW=137Mi (143M)(23.0GiB/179719msec)
3-brick distribute volume: read: IOPS=45.7k, BW=178Mi
(187M)(23.0GiB/132038msec)

volume info (options):
Options Reconfigured:
performance.strict-o-direct: on
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.io-cache: off
performance.client-io-threads: on
performance.io-thread-count: 8
performance.read-ahead: off
transport.address-family: inet
nfs.disable: on

client top output (thread-level) for 3-brick case:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24932 root      20   0 1284136  27576   3680 R 97.9  0.1   4:08.73 glusterfuse+
24925 root      20   0 1284136  27576   3680 R 72.5  0.1   2:26.06 glusterepol+
24928 root      20   0 1284136  27576   3680 S 72.5  0.1   2:26.01 glusterepol+
24927 root      20   0 1284136  27576   3680 R 72.4  0.1   2:25.79 glusterepol+
24929 root      20   0 1284136  27576   3680 S 72.4  0.1   2:26.17 glusterepol+
25093 root      20   0 1284136  27576   3680 R 41.4  0.1   1:07.35 glusteriotw+
25095 root      20   0 1284136  27576   3680 S 41.4  0.1   1:06.56 glusteriotw+

Conclusions:
1. results indicate that we are on the right track with suspecting the delayed
add-back of socket to be one of the culprits for limited IOPS per client.
2. we need to deal with the fuse-thread bottleneck in order to make more
progress here.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=P6UkjiXRGM&a=cc_unsubscribe