[Bugs] [Bug 1240564] New: Gluster commands timeout on SSL enabled system, after adding new node to trusted storage pool

Tue Jul 7 09:01:15 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1240564

            Bug ID: 1240564
           Summary: Gluster commands timeout on SSL enabled system, after
                    adding new node to trusted storage pool
           Product: GlusterFS
           Version: mainline
         Component: glusterd
          Severity: high
          Assignee: kaushal at redhat.com
          Reporter: kaushal at redhat.com
                CC: amukherj at redhat.com, annair at redhat.com,
                    asrivast at redhat.com, bugs at gluster.org,
                    dshetty at redhat.com, gluster-bugs at redhat.com,
                    kaushal at redhat.com, kramdoss at redhat.com,
                    nsathyan at redhat.com, rcyriac at redhat.com,
                    rraja at redhat.com, sasundar at redhat.com,
                    seamurph at redhat.com, vagarwal at redhat.com,
                    vbhat at redhat.com
            Blocks: 1239108

+++ This bug was initially created as a clone of Bug #1239108 +++

Description of problem:

After adding a server to an existing trusted pool(SSL enabled on both IO and
management path), gluster commands timeout on all nodes except the one from
where peer probe was done. 

Version-Release number of selected component (if applicable):

glusterfs-3.7.1-7.el7rhgs.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Setup two nodes with SSL on both IO and management path
2. A 1x2 volume is created and SSL is enabled on this volume
3. Have necessary certificates created, touch /var/lib/glusterd/secure-access
and start glusterd service on new node
4. Do a peer probe from one of the existing two nodes to add the new node.
5. Try any gluster command on any node besides the node from where peer probe
was run

Actual results:

After new node is added to the trusted storage pool, gluster commands should
succeed on any node

Expected results:

Gluster commands timeout

Additional info:

A core has been created to troubleshoot the issue. pasting the bt here. 

#0  0x00007f30ca58f25d in read () from /lib64/libpthread.so.0
#1  0x00007f30ca274f4b in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f30ca272f2b in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f30bdd69204 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f30bdd6a3b5 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f30bdd6bd4b in ssl3_get_message () from /lib64/libssl.so.10
#6  0x00007f30bdd60c02 in ssl3_get_server_hello () from /lib64/libssl.so.10
#7  0x00007f30bdd650c1 in ssl3_connect () from /lib64/libssl.so.10
#8  0x00007f30bdfafd7a in ssl_do (buf=buf at entry=0x0, len=len at entry=0,
func=0x7f30bdd83650 <SSL_connect>, this=0x7f30ac17b810, this=0x7f30ac17b810) at
socket.c:294
#9  0x00007f30bdfb15ef in ssl_setup_connection (this=this at entry=0x7f30ac17b810,
server=server at entry=0) at socket.c:380
#10 0x00007f30bdfb1e35 in socket_connect (this=0x7f30ac17b810, port=<optimized
out>) at socket.c:3079
#11 0x00007f30cb4ec009 in rpc_clnt_reconnect (conn_ptr=0x7f30ac1a3470) at
rpc-clnt.c:419
#12 0x00007f30cb4ecc51 in rpc_clnt_start (rpc=rpc at entry=0x7f30ac1a3440) at
rpc-clnt.c:1116
#13 0x00007f30c02686c8 in glusterd_rpc_create (rpc=rpc at entry=0x7f30ac154b70,
options=<optimized out>, notify_fn=notify_fn at entry=0x7f30c0264c30
<glusterd_peer_rpc_notify>, 
    notify_data=notify_data at entry=0x7f30ac1551e0) at glusterd-handler.c:3306
#14 0x00007f30c0268be8 in glusterd_friend_rpc_create
(this=this at entry=0x7f30cdb36120, peerinfo=peerinfo at entry=0x7f30ac154ad0,
args=args at entry=0x7f30b7ffea00) at glusterd-handler.c:3433
#15 0x00007f30c026960e in glusterd_friend_add_from_peerinfo
(friend=friend at entry=0x7f30ac154ad0, restore=restore at entry=_gf_false,
args=args at entry=0x7f30b7ffea00) at glusterd-handler.c:3544
#16 0x00007f30c0269dff in __glusterd_handle_friend_update
(req=req at entry=0x7f30be1c106c) at glusterd-handler.c:2781
#17 0x00007f30c0264c70 in glusterd_big_locked_handler (req=0x7f30be1c106c,
actor_fn=0x7f30c02696c0 <__glusterd_handle_friend_update>) at
glusterd-handler.c:83
#18 0x00007f30cb4e8549 in rpcsvc_handle_rpc_call (svc=0x7f30cdb41140,
trans=trans at entry=0x7f30ac136b60, msg=msg at entry=0x7f30ac1a2970) at rpcsvc.c:703
#19 0x00007f30cb4e87ab in rpcsvc_notify (trans=0x7f30ac136b60,
mydata=<optimized out>, event=<optimized out>, data=0x7f30ac1a2970) at
rpcsvc.c:797
#20 0x00007f30cb4ea873 in rpc_transport_notify (this=this at entry=0x7f30ac136b60,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f30ac1a2970)
at rpc-transport.c:543
#21 0x00007f30bdfb3ba6 in socket_event_poll_in (this=this at entry=0x7f30ac136b60)
at socket.c:2290
#22 0x00007f30bdfb6a94 in socket_event_handler (fd=fd at entry=8, idx=idx at entry=3,
data=0x7f30ac136b60, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403
#23 0x00007f30cb78158a in event_dispatch_epoll_handler (event=0x7f30b7ffee80,
event_pool=0x7f30cdb1ec90) at event-epoll.c:575
#24 event_dispatch_epoll_worker (data=0x7f30cdba6fe0) at event-epoll.c:678
#25 0x00007f30ca588df5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f30c9ecf1ad in clone () from /lib64/libc.so.6

--- Additional comment from Kaushal on 2015-07-06 12:16:52 IST ---

This happens due to a configuration decision for GlusterD ssl connections,
based on multi-threaded epoll.

Some history first.

SSL connections have a blocking connect, which could block the main epoll
thread during connection intializations. To avoid this blocking SSL own-thread
support was implemented. This created a separate thread and handled the SSL
connection on the new thread. Later when multi-threaded epoll was implemented,
SSL own-thread for GlusterD was disabled as own-thread and mt-epoll didn't work
well together (I've been notified that this should no longer be the case
though, but I need to verify).

But mt-epoll was disabled for GlusterD. Because of this SSL connections in
GlusterD are established from the same thread calling socket_connect. In the
case above the calling thread is the epoll thread, which has been blocked,
leading to every other request failing.

Assuming that peer A and B form the current cluster, and A probed C to add it
to the cluster.

Peer A first validates that peer C can be a member of the cluster. Once it has
done the validation, it sends a friend_update to B and C to notify them of the
other members in the cluster.

This notification is handled in the epoll thread on B and C. B starts a
connection to C and C does vice-versa on the epoll thread. If both B and C
begin establishing their connections simultaneously, then we end up in a
deadlock. SSL_connect will block the epoll threads of both the GlusterDs, and
because epoll being blocked they will not be able to serve new requests, which
will lead to SSL connect being blocked indefinitely.

If either peer B or peer C got their friend_update requests a little later,
then this deadlock wouldn't occur. But this cannot be guaranteed.

Peer A doesn't get blocked this way as the CLI RPC program is multi-threading
enabled. The CLI RPC service launches a new thread for each CLI request. The
Peer RPC service (which includes the friend_update RPC) isn't multi-threaded.
Incoming Peer RPC requests are handled in the epoll thread.

We are currently evaluating two solutions for this,
1. Enable own-thread for GlusterD ssl connections
2. Enable multi-threading for Peer RPC service

Both require proper testing to ensure we don't break anything else.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1239108
[Bug 1239108] Gluster commands timeout on SSL enabled system, after adding
new node to trusted storage pool
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=OwATFfavA7&a=cc_unsubscribe