[Gluster-devel] Some performance issues in mount/fuse
xhernandez at datalab.es
Mon Mar 11 10:49:47 UTC 2013
I've recently performed some tests with gluster on a fast network (IP
over infiniband) and got some unexpected results. It seems that
mount/fuse is becoming a bottleneck when the network and disk are very fast.
I started with a simple distributed volume with 2 bricks mounted on a
ramdisk to avoid possible disk bottlenecks (however I repeated the tests
with an SSD and, later, with a normal hard disk and the results were the
same, probably due to the good work of performance translators). With
this configuration, a single write reached a throughput of ~420 MB/s.
It's way below the maximum network limit, but for a single write it's
quite acceptable. However with two concurrent writes (carefully chosen
so that each one goes to a different brick), the throughput was ~200
MB/s (for each transfer). That was totally unexpected. As there was
plenty of bandwith available and no IO limitation, I was expecting
something near 800 MB/s.
In fact, any combination of concurrent writes always led to the same
combined throughput of ~400 MB/s.
Trying to determine the cause of this odd behavior, I noticed that
mount/fuse uses a single thread to serve kernel requests, and once a
request is received, it is sent down the xlator stack to process it,
only reading additional requests once the stack returns. This means that
to reach a 420 MB/s throughput using 128KB per request (the current
maximum block size), it needs to serve, at least, 3360 requests per
second. In other words, it processes each request in 300 us. If we take
into account that every translator will allocate memory, and do some
system calls, it's quite possible that it really takes 300 us to serve
To see if this is the case, I added the performance/io-threads just
below the mount/fuse. This would queue each request to a different
thread, freeing the current one to read another request much before than
300 us. This should improve the concurrent writes case.
The results are good. Using this simple modification, 2 concurrent
writes performed at ~300 MB/s each one. However the throughput for a
single write dropped to ~250 MB/s. Anyway, this solution is not valid
because there is some incompatibility with this configuration and some
things do not work well (for example a simple 'ls' does not show all the
Then I modified the mount/fuse xlator to start some threads to serve
kernel requests. With this modification all seems to work as expected
and throughput is quite better: a single write still performs at 420
MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination
of 2 or more concurrent writes has a combined throughput of ~650 MB/s.
However, a replicate volume does not improve at all. I'm not sure why.
It seems that there should be some kind of serialization point in
cluster/afr. A single write has a throughput of ~175 MB/s, and 2
concurrent writes ~85 MB/s. I'll have to investigate this further.
Does all this make sense ?
Is this something that would be worth investing more time ?
More information about the Gluster-devel