[Gluster-devel] Should it be possible to disable own-thread for encrypted RPC?

Kaushal M kshlmster at gmail.com
Thu Apr 21 07:22:16 UTC 2016


On Fri, Apr 15, 2016 at 5:28 PM, Jeff Darcy <jdarcy at redhat.com> wrote:
>> I've been testing release-3.7 in the lead up to tagging 3.7.11, and
>> found that the fix I did to allow daemons to start when management
>> encryption is enabled, doesn't work always. The daemons fail to start
>> because they can't connect to glusterd to fetch the volfiles, and the
>> connection failure is partly due to own-thread not being enabled.
>>
>> I'd like to know why own-thread is kept optional, and is not the
>> default for any encrypted connection?
>> Encrypted RPC in GlusterFS can only works with a poller on its
>> own-thread, and cannot work with epoll. When this is the case, why is
>> it even possible to disable own-thread.
>>
>> In GlusterFS currently, own-thread gets enabled for most encrypted
>> connections by default. But in certain cases, it doesn't get enabled
>> when it should be and leads to connection failures. This sort of
>> failure is most visible when a glusterfs/glusterfsd process attempts
>> to fetch volfiles from glusterd.
>>
>> I'm going to be sending a change that removes the option of disabling
>> own-thread, and make all encrypted connections use it. Do you see any
>> reasons not to do this?
>
> The reasons are basically historical.  Own-thread was implemented along
> with SSL as a way to make up for the performance impact of doing SSL in
> our single polling thread.  At the time any combination worked, but the
> defaults were aligned together because they were both new and kind of
> experimental.  I figured people would be willing to risk losing a bit of
> stability to avoid a significant performance loss when they were already
> using another experimental feature to get better security, but they
> wouldn't want to make an already-stable system less so to get (what I
> thought would be) a modest performance gain otherwise.
>
> Time passed.  The performance benefit of own-thread without SSL turned
> out to be greater than I'd thought, I implemented SSL in the management
> as well as the I/O path, SSL became TLS, etc.  Somewhere along the line
> we should have made own-thread the default.  We would have seen a
> performance benefit, and epoll might not have happened, but my attention
> was elsewhere so own-thread didn't become the default and epoll did
> happen.  Just as I had warned people many times, loudly, it broke TLS.
> It also uncovered race conditions elsewhere, and introduced many other
> forms of instability - as I'm sure you know.  IMO it was one of the
> dumber ideas in the history of the project.
>
> So, what do we do *now*?  There are good reasons for us to consider
> making TLS the default, for both I/O and management.  Allowing
> unauthenticated connections is just bad in principle, especially in the
> cloud.  If the quickest route to making TLS stable is to disallow its
> use without own-thread, then I say let's do that . . . and if we're
> going to do that then we might as well get rid of epoll.  If TLS is the
> default, and requires own-thread, then epoll is only applicable in an
> insecure non-default setting.  We don't need to be spending our precious
> time on bugs - both those we already know about and those we have yet to
> find - that only exist in such a context.  "Thread per connection" isn't
> my favorite approach to network concurrency any more than it's anyone
> else's, but for the connection counts we're dealing with it's
> sufficient.  I'd rather maximize stability and development velocity than
> academic elegance.

Thanks for the background all the background Jeff, it was informative.

I've recently become aware of another problem with own-threads. The
threads launched are not reaped, pthread_joined, after a TLS
connection disconnects.
This is especially problematic with GlusterD as it launches a lot of
threads to handle generally short lived connections (volfile fetch,
portmapper).
This causes GlusterDs mem usage to continually grow, and finally lead
to other failures due to memory shortage.
I've recently seen a setup with GlusterD memory usage in 10s of GBs of
reserved mem and TBs of virt mem. This is easily reproducible as well.
I'm still working out a solution for this.

While allowing TLS connections with own-threads only will lead to a
more stable experience, this is a really bad in terms of our memory
consumption.
This will badly affect our chances of having 1000s of clients. Making
TLS work with epoll would fix this, but I'm not very sure of the
effort involved.
Could we fix this for 3.8? For 4.0, if we want to default to TLS, we
definitely need to fix this.


More information about the Gluster-devel mailing list