[Gluster-devel] Objections to multi-thread-epoll and proposal to use own-thread alternative

Tue Oct 14 16:34:29 UTC 2014

On 10/14/2014 08:23 AM, Jeff Darcy wrote:
>> We should try comparing performance of multi-thread-epoll to
>> own-thread, shouldn't be hard to hack own-thread into non-SSL-socket
>> case.
> Own-thread has always been available on non-SSL sockets, from the day it
> was first implemented as part of HekaFS.
>
>> HOWEVER, if "own-thread" implies a thread per network connection, as
>> you scale out a Gluster volume with N bricks, you have O(N) clients,
>> and therefore you have O(N) threads on each glusterfsd (libgfapi
>> adoption would make it far worse)!  Suppose we are implementing a
>> 64-brick configuration with 200 clients, not an unreasonably sized
>> Gluster volume for a scalable filesystem.   We then have 200 threads
>> per Glusterfsd just listening for RPC messages on each brick.  On a
>> 60-drive server there can be a lot more than 1 brick per server, so
>> multiply threads/glusterfsd by brick count!  It doesn't make sense to
>> have total threads >= CPUs, and modern processors make context
>> switching between threads more and more expensive.
> It doesn't make sense to have total *busy* threads >= cores (not CPUs)
> because of context switches, but idle threads are very low-cost.  Also,
> note that multi-threaded epoll is also not free from context-switch
> issues.  The real problem with either approach is "whale" servers with
> large numbers of bricks apiece, vs. "piranha" servers with relatively
> few.  That's an unbalanced system, with too little CPU and memory (and
> probably disk/network bandwidth) relative to capacity.

This is where we engineers come in to play. Given a set of parameters 
it's now our job to build the system to suit our use case. If the 
"whale" server does not suit it, we shouldn't be building it. Conversely 
if performance is not an issue but instead cost density is, we can build 
those whales and be happy with them. Simply document the design 
sufficiently so we can make those decisions.

> That said, I've already conceded that there are probably cases where
> multi-threaded epoll will generate more parallelism than own-thread.
> However, that only matters up to the point where we hit some other
> bottleneck.  The question is whether the difference is apparent *to the
> user* for any configuration and workload we can actually test.  Only
> after we have that answer can we evaluate whether the benefit is greater
> than the risk (of uncovering even more race conditions in other
> components) and the drawback of being unable to support SSL.
>
>> Shyam mentioned a refinement to own-thread where we equally partition
>> the set of TCP connections among a pool of threads (own-thread is a
>> special case of this).
> Some form of this would dovetail very nicely with the idea of
> multiplexing multiple bricks onto a single glusterfsd process, which we
> need to do for other reasons.
>
>> On the Gluster server side, because of the io-threads translator, an
>> RPC listener thread is effectively just starting a worker thread and
>> then going back to read another RPC.  With own-thread, although RPC
>> requests are received in order, there is no guarantee that the
>> requests will be processed in the order that they were received from
>> the network.   On the client side, we have operations such as readdir
>> that will fan out parallel FOPS.  If you use own-thread approach, then
>> these parallel FOP replies can all be processed in parallel by the
>> listener threads, so you get at least the same level of race condition
>> that you would get with multi-thread-epoll.
> You get some race conditions, but not to the same level.  As you've
> already pointed out yourself, multi-threaded epoll can generate greater
> parallelism even among requests arriving on a single connection to a
> single volume.  That is guaranteed to cause data-structure collisions
> that would be impossible otherwise.  Also, let's not forget that either
> change is also applicable on the client side, in glusterd, in self-heal
> and rebalance, etc.  Many of these have their own unique concerns with
> respect to concurrency and reentrancy, and don't already have
> io-threads.  For example, I've had to fix several bugs in this area that
> were unique to glusterd.  At least we've begun to shake out some of
> these issues with own-thread, though I'm sure there are still plenty of
> bugs still to be found.  With multi-threaded epoll we're going to have
> even more issues in this area, and we've barely begun to discover them.
> That's not a fatal problem, but it's definitely a CON.
>
>>>   * CON: multi-epoll does not work with SSL.  It *can't* work with
>>>   OpenSSL at all, short of adopting a hybrid model where SSL
>>>   connections use own-thread while others use multi-epoll, which is a
>>>   bit of a testing nightmare.
>> Why is it a testing nightmare?
> It means having to test *both* sets of code paths, plus the code to hand
> off between them or use them concurrently, in every environment - not
> just those where we hand off to io-threads.
>
>> IMHO it's worth it to carefully trade off architectural purity
> Where does this "architectural purity" idea come from?  This isn't about
> architectural purity.  It's about code that's known to work vs. code
> that might perform better *in theory* but also presents some new issues
> we'd need to address.  I don't like thread-per-connection.  I've
> recommended against it many times.  Whoever made the OpenSSL API so
> unfriendly to other concurrency approaches was a fool.  Nonetheless,
> that's the way the real world is, and *in this particular context* I
> think own-thread has a better risk:reward ratio.
>
>> In summary, to back own-thread alternative I would need to see that a)
>> the own-thread approach is scalable, and that b) performance data
>> shows that own-thread is comparable to multi-thread-epoll in
>> performance.
> Gee, I wonder who we could get to run those tests.  Maybe that would be
> better than mere conjecture (including mine).
>
>> Otherwise, in the absence of any other candidates, we have to go with
>> multi-thread-epoll.
> *Only* on the basis of performance, ignoring the other issues we've
> discussed?  I disagree.  If anything, there seem to be moves afoot to
> de-emphasize the traditional NAS-replacement role in favor of more
> archival/dispersed workloads.  I don't necessarily agree with that, but
> it would make the "performance at any cost" argument even less relevant.

I believe this is an expected Red Hat view since the acquisition of 
InkTank. I'm not accusing anyone of taking a "company line", I just 
expect that there is going to be a shift of focus that will become 
apparent with regard to bug reports, paid customer requirements, etc. 
Upstream, it may not even be recognized. This is, of course, all my 
personal opinion as an outside observer and I could just be talking out 
of a posterior orifice.

>
> P.S. I changed the subject line because I think it's inappropriate to
> make this about person vs. person, taking my side or the opposition's.
> There has been entirely too much divisive behavior on the list already.
> Let's try to focus on the arguments themselves, not who's making them.
+11111111111111111 There's a lot of brilliant (and I don't just mean the 
colloquial European definition) people here who have all done amazing 
things, and are often all correct under their expected parameters.