[Bugs] [Bug 1241784] New: Gluster commands timeout on SSL enabled system, after adding new node to trusted storage pool

Fri Jul 10 06:01:35 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1241784

            Bug ID: 1241784
           Summary: Gluster commands timeout on SSL enabled system, after
                    adding new node to trusted storage pool
           Product: GlusterFS
           Version: 3.7.2
         Component: glusterd
          Severity: high
          Assignee: kaushal at redhat.com
          Reporter: kaushal at redhat.com
                CC: amukherj at redhat.com, annair at redhat.com,
                    asrivast at redhat.com, bugs at gluster.org,
                    dshetty at redhat.com, gluster-bugs at redhat.com,
                    kaushal at redhat.com, kramdoss at redhat.com,
                    nsathyan at redhat.com, rcyriac at redhat.com,
                    rraja at redhat.com, sasundar at redhat.com,
                    seamurph at redhat.com, vagarwal at redhat.com,
                    vbhat at redhat.com
        Depends On: 1240564
            Blocks: 1233025 (glusterfs-3.7.3)

+++ This bug was initially created as a clone of Bug #1240564 +++

+++ This bug was initially created as a clone of Bug #1239108 +++

Description of problem:

After adding a server to an existing trusted pool(SSL enabled on both IO and
management path), gluster commands timeout on all nodes except the one from
where peer probe was done. 

Version-Release number of selected component (if applicable):

glusterfs-3.7.1-7.el7rhgs.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Setup two nodes with SSL on both IO and management path
2. A 1x2 volume is created and SSL is enabled on this volume
3. Have necessary certificates created, touch /var/lib/glusterd/secure-access
and start glusterd service on new node
4. Do a peer probe from one of the existing two nodes to add the new node.
5. Try any gluster command on any node besides the node from where peer probe
was run

Actual results:

After new node is added to the trusted storage pool, gluster commands should
succeed on any node

Expected results:

Gluster commands timeout

Additional info:

A core has been created to troubleshoot the issue. pasting the bt here. 

#0  0x00007f30ca58f25d in read () from /lib64/libpthread.so.0
#1  0x00007f30ca274f4b in sock_read () from /lib64/libcrypto.so.10
#2  0x00007f30ca272f2b in BIO_read () from /lib64/libcrypto.so.10
#3  0x00007f30bdd69204 in ssl3_read_n () from /lib64/libssl.so.10
#4  0x00007f30bdd6a3b5 in ssl3_read_bytes () from /lib64/libssl.so.10
#5  0x00007f30bdd6bd4b in ssl3_get_message () from /lib64/libssl.so.10
#6  0x00007f30bdd60c02 in ssl3_get_server_hello () from /lib64/libssl.so.10
#7  0x00007f30bdd650c1 in ssl3_connect () from /lib64/libssl.so.10
#8  0x00007f30bdfafd7a in ssl_do (buf=buf at entry=0x0, len=len at entry=0,
func=0x7f30bdd83650 <SSL_connect>, this=0x7f30ac17b810, this=0x7f30ac17b810) at
socket.c:294
#9  0x00007f30bdfb15ef in ssl_setup_connection (this=this at entry=0x7f30ac17b810,
server=server at entry=0) at socket.c:380
#10 0x00007f30bdfb1e35 in socket_connect (this=0x7f30ac17b810, port=<optimized
out>) at socket.c:3079
#11 0x00007f30cb4ec009 in rpc_clnt_reconnect (conn_ptr=0x7f30ac1a3470) at
rpc-clnt.c:419
#12 0x00007f30cb4ecc51 in rpc_clnt_start (rpc=rpc at entry=0x7f30ac1a3440) at
rpc-clnt.c:1116
#13 0x00007f30c02686c8 in glusterd_rpc_create (rpc=rpc at entry=0x7f30ac154b70,
options=<optimized out>, notify_fn=notify_fn at entry=0x7f30c0264c30
<glusterd_peer_rpc_notify>, 
    notify_data=notify_data at entry=0x7f30ac1551e0) at glusterd-handler.c:3306
#14 0x00007f30c0268be8 in glusterd_friend_rpc_create
(this=this at entry=0x7f30cdb36120, peerinfo=peerinfo at entry=0x7f30ac154ad0,
args=args at entry=0x7f30b7ffea00) at glusterd-handler.c:3433
#15 0x00007f30c026960e in glusterd_friend_add_from_peerinfo
(friend=friend at entry=0x7f30ac154ad0, restore=restore at entry=_gf_false,
args=args at entry=0x7f30b7ffea00) at glusterd-handler.c:3544
#16 0x00007f30c0269dff in __glusterd_handle_friend_update
(req=req at entry=0x7f30be1c106c) at glusterd-handler.c:2781
#17 0x00007f30c0264c70 in glusterd_big_locked_handler (req=0x7f30be1c106c,
actor_fn=0x7f30c02696c0 <__glusterd_handle_friend_update>) at
glusterd-handler.c:83
#18 0x00007f30cb4e8549 in rpcsvc_handle_rpc_call (svc=0x7f30cdb41140,
trans=trans at entry=0x7f30ac136b60, msg=msg at entry=0x7f30ac1a2970) at rpcsvc.c:703
#19 0x00007f30cb4e87ab in rpcsvc_notify (trans=0x7f30ac136b60,
mydata=<optimized out>, event=<optimized out>, data=0x7f30ac1a2970) at
rpcsvc.c:797
#20 0x00007f30cb4ea873 in rpc_transport_notify (this=this at entry=0x7f30ac136b60,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f30ac1a2970)
at rpc-transport.c:543
#21 0x00007f30bdfb3ba6 in socket_event_poll_in (this=this at entry=0x7f30ac136b60)
at socket.c:2290
#22 0x00007f30bdfb6a94 in socket_event_handler (fd=fd at entry=8, idx=idx at entry=3,
data=0x7f30ac136b60, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403
#23 0x00007f30cb78158a in event_dispatch_epoll_handler (event=0x7f30b7ffee80,
event_pool=0x7f30cdb1ec90) at event-epoll.c:575
#24 event_dispatch_epoll_worker (data=0x7f30cdba6fe0) at event-epoll.c:678
#25 0x00007f30ca588df5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f30c9ecf1ad in clone () from /lib64/libc.so.6

--- Additional comment from Kaushal on 2015-07-06 12:16:52 IST ---

This happens due to a configuration decision for GlusterD ssl connections,
based on multi-threaded epoll.

Some history first.

SSL connections have a blocking connect, which could block the main epoll
thread during connection intializations. To avoid this blocking SSL own-thread
support was implemented. This created a separate thread and handled the SSL
connection on the new thread. Later when multi-threaded epoll was implemented,
SSL own-thread for GlusterD was disabled as own-thread and mt-epoll didn't work
well together (I've been notified that this should no longer be the case
though, but I need to verify).

But mt-epoll was disabled for GlusterD. Because of this SSL connections in
GlusterD are established from the same thread calling socket_connect. In the
case above the calling thread is the epoll thread, which has been blocked,
leading to every other request failing.

Assuming that peer A and B form the current cluster, and A probed C to add it
to the cluster.

Peer A first validates that peer C can be a member of the cluster. Once it has
done the validation, it sends a friend_update to B and C to notify them of the
other members in the cluster.

This notification is handled in the epoll thread on B and C. B starts a
connection to C and C does vice-versa on the epoll thread. If both B and C
begin establishing their connections simultaneously, then we end up in a
deadlock. SSL_connect will block the epoll threads of both the GlusterDs, and
because epoll being blocked they will not be able to serve new requests, which
will lead to SSL connect being blocked indefinitely.

If either peer B or peer C got their friend_update requests a little later,
then this deadlock wouldn't occur. But this cannot be guaranteed.

Peer A doesn't get blocked this way as the CLI RPC program is multi-threading
enabled. The CLI RPC service launches a new thread for each CLI request. The
Peer RPC service (which includes the friend_update RPC) isn't multi-threaded.
Incoming Peer RPC requests are handled in the epoll thread.

We are currently evaluating two solutions for this,
1. Enable own-thread for GlusterD ssl connections
2. Enable multi-threading for Peer RPC service

Both require proper testing to ensure we don't break anything else.

--- Additional comment from Anand Avati on 2015-07-07 14:48:21 IST ---

REVIEW: http://review.gluster.org/11559 (glusterd: Fix management encryption
issues with GlusterD) posted (#1) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Anand Avati on 2015-07-08 00:59:25 IST ---

REVIEW: http://review.gluster.org/11559 (glusterd: Fix management encryption
issues with GlusterD) posted (#2) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Anand Avati on 2015-07-09 11:59:36 IST ---

REVIEW: http://review.gluster.org/11559 (glusterd: Fix management encryption
issues with GlusterD) posted (#3) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Anand Avati on 2015-07-10 07:14:25 IST ---

COMMIT: http://review.gluster.org/11559 committed in master by Krishnan
Parthasarathi (kparthas at redhat.com) 
------
commit 01b82c66155a8d92893a386d7a314c95e0f0702b
Author: Kaushal M <kaushal at redhat.com>
Date:   Tue Jul 7 12:52:30 2015 +0530

    glusterd: Fix management encryption issues with GlusterD

    Management encryption was enabled incorrectly in GlusterD leading to
    issues of cluster deadlocks. This has been fixed with this commit. The
    fix is in two parts,

    1. Correctly enable encrytion for the TCP listener in GlusterD and
    re-enable own-threads for encrypted connections.
      Without this, GlusterD could try to esatblish the blocking SSL
      connects in the epoll thread, for eg. when handling friend updates,
      which could lead to cluster deadlocks.

    2. Explicitly enable encryption for outgoing peer connections.
      Without enabling encryption explicitly for outgoing connections was
      causing SSL socket events to be handled in the epoll thread. Some
      events, like disconnects during peer detach, could lead to connection
      attempts to happen in the epoll thread, leading to deadlocks again.

    Change-Id: I438c2b43f7b1965c0e04d95c000144118d36272c
    BUG: 1240564
    Signed-off-by: Kaushal M <kaushal at redhat.com>
    Reviewed-on: http://review.gluster.org/11559
    Tested-by: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Krishnan Parthasarathi <kparthas at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1233025
[Bug 1233025] GlusterFS 3.7.3 tracker
https://bugzilla.redhat.com/show_bug.cgi?id=1240564
[Bug 1240564] Gluster commands timeout on SSL enabled system, after adding
new node to trusted storage pool
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=jIFGyZslD8&a=cc_unsubscribe