[Gluster-devel] Objections to multi-thread-epoll and proposal to use own-thread alternative

Tue Oct 14 15:23:01 UTC 2014

> We should try comparing performance of multi-thread-epoll to
> own-thread, shouldn't be hard to hack own-thread into non-SSL-socket
> case.

Own-thread has always been available on non-SSL sockets, from the day it
was first implemented as part of HekaFS.

> HOWEVER, if "own-thread" implies a thread per network connection, as
> you scale out a Gluster volume with N bricks, you have O(N) clients,
> and therefore you have O(N) threads on each glusterfsd (libgfapi
> adoption would make it far worse)!  Suppose we are implementing a
> 64-brick configuration with 200 clients, not an unreasonably sized
> Gluster volume for a scalable filesystem.   We then have 200 threads
> per Glusterfsd just listening for RPC messages on each brick.  On a
> 60-drive server there can be a lot more than 1 brick per server, so
> multiply threads/glusterfsd by brick count!  It doesn't make sense to
> have total threads >= CPUs, and modern processors make context
> switching between threads more and more expensive.

It doesn't make sense to have total *busy* threads >= cores (not CPUs)
because of context switches, but idle threads are very low-cost.  Also,
note that multi-threaded epoll is also not free from context-switch
issues.  The real problem with either approach is "whale" servers with
large numbers of bricks apiece, vs. "piranha" servers with relatively
few.  That's an unbalanced system, with too little CPU and memory (and
probably disk/network bandwidth) relative to capacity.

That said, I've already conceded that there are probably cases where
multi-threaded epoll will generate more parallelism than own-thread.
However, that only matters up to the point where we hit some other
bottleneck.  The question is whether the difference is apparent *to the
user* for any configuration and workload we can actually test.  Only
after we have that answer can we evaluate whether the benefit is greater
than the risk (of uncovering even more race conditions in other
components) and the drawback of being unable to support SSL.

> Shyam mentioned a refinement to own-thread where we equally partition
> the set of TCP connections among a pool of threads (own-thread is a
> special case of this).

Some form of this would dovetail very nicely with the idea of
multiplexing multiple bricks onto a single glusterfsd process, which we
need to do for other reasons.

> On the Gluster server side, because of the io-threads translator, an
> RPC listener thread is effectively just starting a worker thread and
> then going back to read another RPC.  With own-thread, although RPC
> requests are received in order, there is no guarantee that the
> requests will be processed in the order that they were received from
> the network.   On the client side, we have operations such as readdir
> that will fan out parallel FOPS.  If you use own-thread approach, then
> these parallel FOP replies can all be processed in parallel by the
> listener threads, so you get at least the same level of race condition
> that you would get with multi-thread-epoll.

You get some race conditions, but not to the same level.  As you've
already pointed out yourself, multi-threaded epoll can generate greater
parallelism even among requests arriving on a single connection to a
single volume.  That is guaranteed to cause data-structure collisions
that would be impossible otherwise.  Also, let's not forget that either
change is also applicable on the client side, in glusterd, in self-heal
and rebalance, etc.  Many of these have their own unique concerns with
respect to concurrency and reentrancy, and don't already have
io-threads.  For example, I've had to fix several bugs in this area that
were unique to glusterd.  At least we've begun to shake out some of
these issues with own-thread, though I'm sure there are still plenty of
bugs still to be found.  With multi-threaded epoll we're going to have
even more issues in this area, and we've barely begun to discover them.
That's not a fatal problem, but it's definitely a CON.

> >  * CON: multi-epoll does not work with SSL.  It *can't* work with
> >  OpenSSL at all, short of adopting a hybrid model where SSL
> >  connections use own-thread while others use multi-epoll, which is a
> >  bit of a testing nightmare.
>
> Why is it a testing nightmare?

It means having to test *both* sets of code paths, plus the code to hand
off between them or use them concurrently, in every environment - not
just those where we hand off to io-threads.

> IMHO it's worth it to carefully trade off architectural purity

Where does this "architectural purity" idea come from?  This isn't about
architectural purity.  It's about code that's known to work vs. code
that might perform better *in theory* but also presents some new issues
we'd need to address.  I don't like thread-per-connection.  I've
recommended against it many times.  Whoever made the OpenSSL API so
unfriendly to other concurrency approaches was a fool.  Nonetheless,
that's the way the real world is, and *in this particular context* I
think own-thread has a better risk:reward ratio.

> In summary, to back own-thread alternative I would need to see that a)
> the own-thread approach is scalable, and that b) performance data
> shows that own-thread is comparable to multi-thread-epoll in
> performance.

Gee, I wonder who we could get to run those tests.  Maybe that would be
better than mere conjecture (including mine).

> Otherwise, in the absence of any other candidates, we have to go with
> multi-thread-epoll.

*Only* on the basis of performance, ignoring the other issues we've
discussed?  I disagree.  If anything, there seem to be moves afoot to
de-emphasize the traditional NAS-replacement role in favor of more
archival/dispersed workloads.  I don't necessarily agree with that, but
it would make the "performance at any cost" argument even less relevant.

P.S. I changed the subject line because I think it's inappropriate to
make this about person vs. person, taking my side or the opposition's.
There has been entirely too much divisive behavior on the list already.
Let's try to focus on the arguments themselves, not who's making them.