[Gluster-devel] Some performance issues in mount/fuse

Xavier Hernandez xhernandez at datalab.es
Tue Mar 12 08:30:56 UTC 2013

AFAIK kernel does not allow requests bigger than 128KB and gluster has 
this limit hardcoded in fuse-bridge.c. Currently it is not possible to 
increase or decrease this value.

I made the tests using maximum block sizes.

Al 12/03/13 08:16, En/na lierihanmei ha escrit:
> When glusterfs mount fuse, It uses the max_read=128KB option.  Any big 
>  request would be split. Tuning the option, it will be faster in big 
> read and write, but no use for small files.
> At 2013-03-11 18:49:47,"Xavier Hernandez" <xhernandez at datalab.es> wrote:
> >Hello,
> >
> >I've recently performed some tests with gluster on a fast network (IP
> >over infiniband) and got some unexpected results. It seems that
> >mount/fuse is becoming a bottleneck when the network and disk are very fast.
> >
> >I started with a simple distributed volume with 2 bricks mounted on a
> >ramdisk to avoid possible disk bottlenecks (however I repeated the tests
> >with an SSD and, later, with a normal hard disk and the results were the
> >same, probably due to the good work of performance translators). With
> >this configuration, a single write reached a throughput of ~420 MB/s.
> >It's way below the maximum network limit, but for a single write it's
> >quite acceptable. However with two concurrent writes (carefully chosen
> >so that each one goes to a different brick), the throughput was ~200
> >MB/s (for each transfer). That was totally unexpected. As there was
> >plenty of bandwith available and no IO limitation, I was expecting
> >something near 800 MB/s.
> >
> >In fact, any combination of concurrent writes always led to the same
> >combined throughput of ~400 MB/s.
> >
> >Trying to determine the cause of this odd behavior, I noticed that
> >mount/fuse uses a single thread to serve kernel requests, and once a
> >request is received, it is sent down the xlator stack to process it,
> >only reading additional requests once the stack returns. This means that
> >to reach a 420 MB/s throughput using 128KB per request (the current
> >maximum block size), it needs to serve, at least, 3360 requests per
> >second. In other words, it processes each request in 300 us. If we take
> >into account that every translator will allocate memory, and do some
> >system calls, it's quite possible that it really takes 300 us to serve
> >each request.
> >
> >To see if this is the case, I added the performance/io-threads just
> >below the mount/fuse. This would queue each request to a different
> >thread, freeing the current one to read another request much before than
> >300 us. This should improve the concurrent writes case.
> >
> >The results are good. Using this simple modification, 2 concurrent
> >writes performed at ~300 MB/s each one. However the throughput for a
> >single write dropped to ~250 MB/s. Anyway, this solution is not valid
> >because there is some incompatibility with this configuration and some
> >things do not work well (for example a simple 'ls' does not show all the
> >files).
> >
> >Then I modified the mount/fuse xlator to start some threads to serve
> >kernel requests. With this modification all seems to work as expected
> >and throughput is quite better: a single write still performs at 420
> >MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination
> >of 2 or more concurrent writes has a combined throughput of ~650 MB/s.
> >
> >However, a replicate volume does not improve at all. I'm not sure why.
> >It seems that there should be some kind of serialization point in
> >cluster/afr. A single write has a throughput of ~175 MB/s, and 2
> >concurrent writes ~85 MB/s. I'll have to investigate this further.
> >
> >Does all this make sense ?
> >
> >Is this something that would be worth investing more time ?
> >
> >Regards,
> >
> >Xavi
> >
