[Gluster-devel] Performance improvements

Thu Jan 24 15:47:26 UTC 2019

Hi all,

I've just updated a patch [1] that implements a new thread pool based on a
wait-free queue provided by userspace-rcu library. The patch also includes
an auto scaling mechanism that only keeps running the needed amount of
threads for the current workload.

This new approach has some advantages:

   - It's provided globally inside libglusterfs instead of inside an xlator

This makes it possible that fuse thread and epoll threads transfer the
received request to another thread sooner, wating less CPU and reacting
sooner to other incoming requests.

   - Adding jobs to the queue used by the thread pool only requires an
   atomic operation

This makes the producer side of the queue really fast, almost with no delay.

   - Contention is reduced

The producer side has negligible contention thanks to the wait-free enqueue
operation based on an atomic access. The consumer side requires a mutex,
but the duration is very small and the scaling mechanism makes sure that
there are no more threads than needed contending for the mutex.

This change disables io-threads, since it replaces part of its
functionality. However there are two things that could be needed from
io-threads:

   - Prioritization of fops

Currently, io-threads assigns priorities to each fop, so that some fops are
handled before than others.

   - Fair distribution of execution slots between clients

Currently, io-threads processes requests from each client in round-robin.

These features are not implemented right now. If they are needed, probably
the best thing to do would be to keep them inside io-threads, but change
its implementation so that it uses the global threads from the thread pool
instead of its own threads.

If this change proves it's performing better and is merged, I have some
more ideas to improve other areas of gluster:

   - Integrate synctask threads into the new thread pool

I think there is some contention in these threads because during some tests
I've seen they were consuming most of the CPU. Probably they suffer from
the same problem than io-threads, so replacing them could improve things.

   - Integrate timers into the new thread pool

My idea is to create a per-thread timer where code executed in one thread
will create timer events in the same thread. This makes it possible to use
structures that don't require any mutex to be modified.

Since the thread pool is basically executing computing tasks, which are
fast, I think it's feasible to implement a timer in the main loop of each
worker thread with a resolution of few millisecond, which I think is good
enough for gluster needs.

   - Integrate with userspace-rcu library in QSBR mode

This will make it possible to use some RCU-based structures for anything
gluster uses (inodes, fd's, ...). These structures have very fast read
operations, which should reduce contention and improve performance in many
places.

   - Integrate I/O threads into the thread pool and reduce context switches

The idea here is a bit more complex. Basically I would like to have a
function that does an I/O on some device (for example reading fuse requests
or waiting for epoll events). We could send a request to the thread pool to
execute that function, so it would be executed inside one of the working
threads. When the I/O terminates (i.e. it has received a request), the idea
is that a call to the same function is added to the thread pool, so that
another thread could continue waiting for requests, but the current thread
will start processing the received request without a context switch.

Note that with all these changes, all dedicated threads that we currently
have in gluster could be replaced by the features provided by this new
thread pool, so these would be the only threads present in gluster. This is
specially important when brick-multiplex is used.

I've done some simple tests using a replica 3 volume and a diserse 4+2
volume. These tests are executed on a single machine using an HDD for each
brick (not the best scenario, but it should be fine for comparison). The
machine is quite powerful (dual Intel Xeon Silver 4114 @2.2 GHz, with 128
GiB RAM).

These tests have shown that the limiting factor has been the disk in most
cases, so it's hard to tell if the change has really improved things. There
is only one clear exception: self-heal on a dispersed volume completes
12.7% faster. The utilization of CPU has also dropped drastically:

Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait

New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait

Now I'm running some more tests on NVMe to try to see the effects of the
change when disk is not limiting performance. I'll update once I've more
data.

Xavi

[1] https://review.gluster.org/c/glusterfs/+/20636
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190124/4f8f612b/attachment.html>