[Gluster-devel] Some performance issues in mount/fuse

Mon Mar 11 10:49:47 UTC 2013

Hello,

I've recently performed some tests with gluster on a fast network (IP 
over infiniband) and got some unexpected results. It seems that 
mount/fuse is becoming a bottleneck when the network and disk are very fast.

I started with a simple distributed volume with 2 bricks mounted on a 
ramdisk to avoid possible disk bottlenecks (however I repeated the tests 
with an SSD and, later, with a normal hard disk and the results were the 
same, probably due to the good work of performance translators). With 
this configuration, a single write reached a throughput of ~420 MB/s. 
It's way below the maximum network limit, but for a single write it's 
quite acceptable. However with two concurrent writes (carefully chosen 
so that each one goes to a different brick), the throughput was ~200 
MB/s (for each transfer). That was totally unexpected. As there was 
plenty of bandwith available and no IO limitation, I was expecting 
something near 800 MB/s.

In fact, any combination of concurrent writes always led to the same 
combined throughput of ~400 MB/s.

Trying to determine the cause of this odd behavior, I noticed that 
mount/fuse uses a single thread to serve kernel requests, and once a 
request is received, it is sent down the xlator stack to process it, 
only reading additional requests once the stack returns. This means that 
to reach a 420 MB/s throughput using 128KB per request (the current 
maximum block size), it needs to serve, at least, 3360 requests per 
second. In other words, it processes each request in 300 us. If we take 
into account that every translator will allocate memory, and do some 
system calls, it's quite possible that it really takes 300 us to serve 
each request.

To see if this is the case, I added the performance/io-threads just 
below the mount/fuse. This would queue each request to a different 
thread, freeing the current one to read another request much before than 
300 us. This should improve the concurrent writes case.

The results are good. Using this simple modification, 2 concurrent 
writes performed at ~300 MB/s each one. However the throughput for a 
single write dropped to ~250 MB/s. Anyway, this solution is not valid 
because there is some incompatibility with this configuration and some 
things do not work well (for example a simple 'ls' does not show all the 
files).

Then I modified the mount/fuse xlator to start some threads to serve 
kernel requests. With this modification all seems to work as expected 
and throughput is quite better: a single write still performs at 420 
MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination 
of 2 or more concurrent writes has a combined throughput of ~650 MB/s.

However, a replicate volume does not improve at all. I'm not sure why. 
It seems that there should be some kind of serialization point in 
cluster/afr. A single write has a throughput of ~175 MB/s, and 2 
concurrent writes ~85 MB/s. I'll have to investigate this further.

Does all this make sense ?

Is this something that would be worth investing more time ?

Regards,

Xavi