[Bugs] [Bug 1272940] New: Shd can't reconnect after ping-timeout (error in polling loop; invalid argument: this->private)

Mon Oct 19 09:23:50 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1272940

            Bug ID: 1272940
           Summary: Shd can't reconnect after ping-timeout (error in
                    polling loop; invalid argument: this->private)
           Product: GlusterFS
           Version: 3.7.5
         Component: glusterd
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: mlanz-redhat-bugzilla at rokkt.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
SHD can't reconnect when other server died. Distributed-Replicate volume, 2x2
bricks, two servers.

[2015-10-16 17:06:02.069511] D [socket.c:280:ssl_do] 0-gv0-client-3: syscall
error (probably remote disconnect)
[2015-10-16 17:06:02.069555] W [socket.c:588:__socket_rwv] 0-gv0-client-3:
readv on xxxx1:49153 failed (No data available)
[2015-10-16 17:06:02.069559] D [socket.c:280:ssl_do] 0-gv0-client-0: syscall
error (probably remote disconnect)
[2015-10-16 17:06:02.069582] E [socket.c:2501:socket_poller] 0-gv0-client-3:
error in polling loop
[2015-10-16 17:06:02.069606] W [socket.c:588:__socket_rwv] 0-gv0-client-0:
readv on xxxx1:49152 failed (No data available)
[2015-10-16 17:06:02.069656] E [socket.c:2501:socket_poller] 0-gv0-client-0:
error in polling loop
[2015-10-16 17:06:02.069694] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-gv0-client-3: disconnected from
gv0-client-3. Client process will keep trying to connect to glusterd until
brick's port is available
[2015-10-16 17:06:02.069834] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-gv0-client-0: disconnected from
gv0-client-0. Client process will keep trying to connect to glusterd until
brick's port is available

and then every 3 seconds:
[2015-10-16 17:06:15.348616] T [rpc-clnt.c:418:rpc_clnt_reconnect]
0-gv0-client-3: attempting reconnect
[2015-10-16 17:06:15.348725] E [socket.c:2863:socket_connect]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb)
[0x7f3bf4cf66bb]
-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)
[0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d)
[0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]
[2015-10-16 17:06:15.348792] T [rpc-clnt.c:418:rpc_clnt_reconnect]
0-gv0-client-0: attempting reconnect
[2015-10-16 17:06:15.348858] E [socket.c:2863:socket_connect]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb)
[0x7f3bf4cf66bb]
-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)
[0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d)
[0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]

- The problem does not occur with disabled SSL (server.ssl: off; client.ssl:
off). 
- The problem does not occur when second brick is deleted and only one
reconnect happens
- Also affects 3.6.6

Version-Release number of selected component (if applicable):
Ubuntu 14.04.3 LTS
glusterfs-server                    3.7.5-ubuntu1~trusty1
glusterfs-client                    3.7.5-ubuntu1~trusty1
glusterfs-common                    3.7.5-ubuntu1~trusty1
libssl1.0.0:amd64                   1.0.1f-1ubuntu2.15

How reproducible:
Setup: 
Two nodes, two bricks, replica 2
server.ssl: on; client.ssl: on; auth.ssl-allow *; ssl.cipher-list HIGH:!SSLv2

Steps to Reproduce:
1. pkill -f gluster on node1
2. look at glustershd.log of node2, "error in polling loop"

After restart everything works fine:
3. pkill -f gluster on node2
4. restart gluster on both nodes
5. -> reconnection works and healing starts

Actual results:
No reconnect, no healing, error msg in log every few seconds. No outgoing SYN
packets.

Expected results:
Reconnect, healing

Speculation (almost 100%):
As it only happens with SSL and when 0-gv0-client-3 and 0-gv0-client-0 try to
reconnect simultaneously: Race-condition in SSL handling?
https://bugzilla.redhat.com/show_bug.cgi?id=906763

Additional info:
Already talked to JoeJulian on #gluster.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.