[Gluster-devel] Jeff Darcy's objections to multi-thread-epoll and proposal to use own-thread alternative

Tue Oct 14 14:23:50 UTC 2014

This e-mail is specifically about use of multi-thread-epoll optimization (originally prototyped by Anand Avati) to solve a Gluster performance problem: single-threaded reception of protocol messages (for non-SSL sockets), and consequent inability to fully utilize available CPU on server.  A discussion of its pros and cons follows, along with the alternative to it suggested by Jeff Darcy, referred to as "own-thread" below.  Thanks to Shyam Ranganathan for helping me to clarify my thoughts on this.  Attached is some performance data about multi-thread-epoll.

To see why this threading discussion matters, consider that storage hardware encountered in the enterprise server world is rapidly speeding up with new hardware such as 40-Gbps networks and SSDs, but CPUs are not speeding up nearly as much.  Instead, we have more cores per socket.  So adequate performance for Gluster will require use of sufficient threads to match CPU throughput to network and storage.  

One way to get the server's idle CPU horsepower engaged is JBOD (just a bunch of disks, no RAID) - since there is one glusterfsd, hence 1 epoll thread per brick (disk).   This causes scalability problems for small-file creates (cluster.lookup-unhashed=on is default), and it limits throughput of an individual file to the speed of the disk drive, so until these problems are addressed, the utility of JBOD approach is limited.

----- Original Message -----
> From: "Jeff Darcy" <jdarcy at redhat.com>
> To: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Wednesday, October 8, 2014 4:20:34 PM
> Subject: [Gluster-devel] jdarcy status (October 2014)
> 
> Multi-threading is even more controversial.  It has also been in the
> tree for two years (it was developed to address the problem of SSL code
> slowing down our entire transport stack).  This feature, controlled by
> the "own-thread" transport option, uses a thread per connection - not my
> favorite concurrency model, but kind of necessary to deal with the
> OpenSSL API.  More recently, a *completely separate* approach to
> multi-threading - "multi-threaded epoll" - has been getting some
> attention.  Here's what I see as the pros and cons of this new approach.
> 
>  * PRO: greater parallelism of requests on a single connection.  I think
>    the actual performance benefits vs. own-thread are unproven and
>    likely to be small, but they're real.
>

We should try comparing performance of multi-thread-epoll to own-thread, shouldn't be hard to hack own-thread into non-SSL-socket case.  

HOWEVER, if "own-thread" implies a thread per network connection, as you scale out a Gluster volume with N bricks, you have O(N) clients, and therefore you have O(N) threads on each glusterfsd (libgfapi adoption would make it far worse)!  Suppose we are implementing a 64-brick configuration with 200 clients, not an unreasonably sized Gluster volume for a scalable filesystem.   We then have 200 threads per Glusterfsd just listening for RPC messages on each brick.  On a 60-drive server there can be a lot more than 1 brick per server, so multiply threads/glusterfsd by brick count!  It doesn't make sense to have total threads >= CPUs, and modern processors make context switching between threads more and more expensive.  

Shyam mentioned a refinement to own-thread where we equally partition the set of TCP connections among a pool of threads (own-thread is a special case of this).  This cannot supply an individual client with more than 1 thread to receive RPCs, even when most of CPU cores on the server are idle.  Why impose this constraint (see below)?  To see why this is important, consider a common use case: KVM virtualization.  

SSDs require orders of magnitude more IOPS from glusterfsd and glusterfs than a traditional rotating disk.  So even if you dedicate a thread to a single network connection, this thread may still have trouble keeping up with the high-speed network and the SSD.  Multi-thread-epoll is the only proposal so far that offers a way to apply enough CPU to this problem.  Consider that some SSDs have throughput on the order of a million IOPS (I/O operations per second).  In the past, we have worked around this problem by placing multiple bricks on a single SSD, but this causes other problems (scalability, free space measurement).

>  * CON: with greater concurrency comes greater potential to uncover race
>    conditions in other modules used to being single-threaded.  We've
>    already seen this somewhat with own-thread, and we'd see it more with
>    multi-epoll.
> 

On the Gluster server side, because of the io-threads translator, an RPC listener thread is effectively just starting a worker thread and then going back to read another RPC.  With own-thread, although RPC requests are received in order, there is no guarantee that the requests will be processed in the order that they were received from the network.   On the client side, we have operations such as readdir that will fan out parallel FOPS.  If you use own-thread approach, then these parallel FOP replies can all be processed in parallel by the listener threads, so you get at least the same level of race condition that you would get with multi-thread-epoll.

>  * CON: multi-epoll does not work with SSL.  It *can't* work with
>    OpenSSL at all, short of adopting a hybrid model where SSL
>    connections use own-thread while others use multi-epoll, which is a
>    bit of a testing nightmare.
> 

Why is it a testing nightmare?  Once the RPC message is received, both multi-epoll and own-thread are doing the same thing and handing off to a translator that can do a stack wind/unwind to start the message processing, am I right?  So the code path unifies at that point.  As stated above, both approaches have a similar level of race conditions that might be exposed.  Shyam is of the opinion that we have already exposed many of them with Ganesha and SMB.    

> Obviously I'm not a fan of multi-epoll.  The first point suggests little
> or no benefit.  The second suggests greater risk.  The third is almost
> fatal all by itself, and BTW it was known all along.  Don't we have
> better things to do?

IMHO it's worth it to carefully trade off architectural purity in a few places to achieve improved  performance.  In summary, to back own-thread alternative I would need to see that a) the own-thread approach is scalable, and that b) performance data shows that own-thread is comparable to multi-thread-epoll in performance.  Otherwise, in the absence of any other candidates, we have to go with multi-thread-epoll.  I don't think we can make much progress without multi-threading RPC message reception. We have already reduced the number of system calls needed to receive an RPC as much as we can in some cases - this has helped, but it's just not enough (see bz 800892, 821087). 

opinions appreciated, now is time to speak up...

-ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mtep.pdf
Type: application/pdf
Size: 33428 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141014/c4f9f707/attachment-0001.pdf>