[Gluster-devel] Gluster performance updates

Mon Oct 1 11:45:28 UTC 2018

Hi,

this is an update containing some work done regarding performance and
consistency during latest weeks. We'll try to build a complete list of all
known issues and track them through this email thread. Please, let me know
of any performance issue not included in this email so that we can build
and track all of them.

*New improvements*

While testing performance on Red Hat products, we have identified a problem
in the way eager-locking was working on replicate volumes for some
scenarios (virtualization and database workloads were affected). It caused
an unnecessary amount of finodelk and fxattrop requests, that was
increasing latency of write operations.

This has already been fixed with patches [1] and [2].

We have also identified some additional settings that provide better
performance for database workloads. A patch [3] to update the default
database profile with the new settings has been merged.

Combining all these changes (AFR fix and settings), pgbench performance has
improved ~300% on bare metal using NVME, and a random I/O fio test running
on VM has also improved more than 300%.

*Known issues*

We have identified two issues in fuse mounts:

   - Becasue of selinux in client machine, a getxattr request is sent by
   fuse before each write request. Though it adds some latency, currently this
   request is directly answered by fuse xlator when selinux is not enabled in
   gluster (default setting).

   - When *fopen-keep-cache* is enabled (default setting), kernel fuse
   sends stat requests before each read. Even disabling fopen-keep-cache, fuse
   still sends half of the stat requests. This has been tracked down to the
   atime update, however mounting a volume with noatime doesn't help to solve
   the issue because kernel fuse doesn't correctly handle noatime setting.

Some other issues are detected:

   - Bad performance of write-behind when stat and writes to the same file
   are mixed. Right now, when a stat is received, all previous cached writes
   are flushed before processing the new request. The same happens for reads
   when it overlaps with a cached previous write. This makes write-behind
   useless in this scenario.

*Note*: fuse is currently sending stat requests before reads (see previous
known issue), making reads almost as problematic as stat requests.

   - Self-heal seems to be slow. It's still being investigated but there
   are some indications that we have a considerable amount of contention in
   io-threads. This contention could be the cause of some other performance
   issues, but we'll need to investigate more about this. There is already
   some work [4] trying to reduce it.

   - 'ls' performance is not good in some cases. When the volume has many
   bricks, 'ls' performance tends to degrade. We are still investigating the
   cause, but one important factor is that DHT sends readdir(p) requests to
   all its subvolumes, This means that 'ls' will run at the speed of the
   slower of the bricks. If any brick has an issue, or a spike in load, even
   if it's transitory, it will have a bad impact in 'ls' performance. This can
   be alleviated by enabling parallel-readdir and readdir-ahead option.

*Note*: There have been reports that enabling parallel-readdir causes some
entries to apparently disappear after some time (though they are still
present on the bricks). I'm not aware of the root cause yet.

   - The number of threads in a server is quite high when multiple bricks
   are present, even if brick-mux is used. There are some efforts [5] trying
   to reduce this number.

*New features*

We have recently started the design [6] of a new caching infrastructure
that should provide much better performance, specially for small files or
metadata intensive workloads. It should also provide a safe infrastructure
to keep cached information consistent on all clients.

This framework will make caching features available to any xlator that
could need them in an easy and safe way.

The current thinking is that current existing caching xlators (mostly
md-cache, io-cache and write-behind) will probably be reworked as a single
complete caching xlator, since this makes things easier.

Any feedback or ideas will be highly appreciated.

Xavi

[1] https://review.gluster.org/21107
[2] https://review.gluster.org/21210
[3] https://review.gluster.org/21247
[4] https://review.gluster.org/21039
[5] https://review.gluster.org/20859
[6]
https://docs.google.com/document/d/1elX-WZfPWjfTdJxXhgwq37CytRehPO4D23aaVowtiE8/edit?usp=sharing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20181001/f44ca3f1/attachment-0001.html>