[Bugs] [Bug 1373723] New: glusterd experiencing repeated connect/ disconnect messages when shd is down

Wed Sep 7 04:19:42 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1373723

            Bug ID: 1373723
           Summary: glusterd experiencing repeated connect/disconnect
                    messages when shd is down
           Product: GlusterFS
           Version: 3.8.3
         Component: protocol
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    ravishankar at redhat.com, rgowdapp at redhat.com
        Depends On: 1372356

+++ This bug was initially created as a clone of Bug #1372356 +++

Description of problem:
Found this while testing the afr eventing framework patch:

On a 1x2 replicate volume, single node setup, when I kill glustershd, I see
glusterd sending events toggling repeatedly between connected and disconnected.

2016-09-01 16:35:20  SVC_DISCONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:21  SVC_CONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:21  SVC_DISCONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:24  SVC_CONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:24  SVC_DISCONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:27  SVC_CONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd
2016-09-01 16:35:27  SVC_DISCONNECTED e3876d55-89ab-406b-9a49-e32899f70e5d
svc_name=glustershd

How reproducible:
Always.

Steps to Reproduce:
1. Start glusterd in debug mode
2. Create a 1x2 replica volume and start it.
3. Kill the self-heal daemon
4. From the glusterd logs:

2016-09-01 13:37:59.915866] D [socket.c:3058:socket_connect] 0-glustershd:
connection attempt on /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket
failed, (Connection refused)
[2016-09-01 13:37:59.916053] D [MSGID: 0]
[glusterd-svc-mgmt.c:318:glusterd_svc_common_rpc_notify] 0-management:
glustershd has connected with glusterd.
[2016-09-01 13:37:59.916268] D [socket.c:2384:socket_event_handler]
0-transport: disconnecting now
[2016-09-01 13:37:59.917535] D
[rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f9e23ab4191] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0xc5)[0x7f9e2387fe02]
(-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xcb)[0x7f9e23879ba9]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x1aa)[0x7f9e2387a689] (-->
/usr/local/lib/libgfrpc.so.0(rpc_transport_notify+0x10f)[0x7f9e23876b70] )))))
0-: /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket: ping timer event
already removed
[2016-09-01 13:37:59.917649] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
glustershd has disconnected from glusterd.
[2016-09-01 13:37:59.917802] D [MSGID: 0]
[event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on
idx=2 from gen=13 to slot->gen=14, fd=6, slot->fd=6

[2016-09-01 13:38:02.916508] D [socket.c:3058:socket_connect] 0-glustershd:
connection attempt on /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket
failed, (Connection refused)
[2016-09-01 13:38:02.916567] D [MSGID: 0]
[glusterd-svc-mgmt.c:318:glusterd_svc_common_rpc_notify] 0-management:
glustershd has connected with glusterd.
[2016-09-01 13:38:02.916667] D [socket.c:2384:socket_event_handler]
0-transport: disconnecting now
[2016-09-01 13:38:02.916811] D
[rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f9e23ab4191] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0xc5)[0x7f9e2387fe02]
(-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xcb)[0x7f9e23879ba9]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x1aa)[0x7f9e2387a689] (-->
/usr/local/lib/libgfrpc.so.0(rpc_transport_notify+0x10f)[0x7f9e23876b70] )))))
0-: /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket: ping timer event
already removed
[2016-09-01 13:38:02.916830] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
glustershd has disconnected from glusterd.
[2016-09-01 13:38:02.916888] D [MSGID: 0]
[event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on
idx=2 from gen=16 to slot->gen=17, fd=6, slot->fd=6

[2016-09-01 13:38:05.917660] D [socket.c:3058:socket_connect] 0-glustershd:
connection attempt on /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket
failed, (Connection refused)
[2016-09-01 13:38:05.917720] D [MSGID: 0]
[glusterd-svc-mgmt.c:318:glusterd_svc_common_rpc_notify] 0-management:
glustershd has connected with glusterd.
[2016-09-01 13:38:05.917940] D [socket.c:2384:socket_event_handler]
0-transport: disconnecting now
[2016-09-01 13:38:05.918274] D
[rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f9e23ab4191] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0xc5)[0x7f9e2387fe02]
(-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xcb)[0x7f9e23879ba9]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x1aa)[0x7f9e2387a689] (-->
/usr/local/lib/libgfrpc.so.0(rpc_transport_notify+0x10f)[0x7f9e23876b70] )))))
0-: /var/run/gluster/628f74c872e682fe7c5003e222cab86a.socket: ping timer event
already removed
[2016-09-01 13:38:05.918299] I [MSGID: 106006]
[glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
glustershd has disconnected from glusterd.
[2016-09-01 13:38:05.918467] D [MSGID: 0]
[event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on
idx=2 from gen=19 to slot->gen=20, fd=6, slot->fd=6

--- Additional comment from Raghavendra G on 2016-09-06 01:35:17 EDT ---

Analysis from Atin:

<analysis>

I am seeing glusterd is receiving RPC_CLNT_CONNECT in
glusterd_svc_common_rpc_notify () even though shd service is not up (note -
this is applicable for all daemon services including brick process). In further
debugging what I found is socket_connect_finish () calls
__socket_connect_finish () which returns 0 which means its able to connect to
the socket and hence RPC_CLNT_CONNECT event is sent to the transport. Is this
right? My understanding is __socket_connect_finish () shouldn't be returning 0
here.

</analysis>

--- Additional comment from Raghavendra G on 2016-09-06 01:40:15 EDT ---

(In reply to Raghavendra G from comment #1)
> Analysis from Atin:
> 
> <analysis>
> 
> I am seeing glusterd is receiving RPC_CLNT_CONNECT in
> glusterd_svc_common_rpc_notify () even though shd service is not up (note -
> this is applicable for all daemon services including brick process). In
> further debugging what I found is socket_connect_finish () calls
> __socket_connect_finish () which returns 0 which means its able to connect
> to the socket and hence RPC_CLNT_CONNECT event is sent to the transport. Is
> this right? My understanding is __socket_connect_finish () shouldn't be
> returning 0 here.

Yes. __socket_connect_finish (or getsockopt (socket, SOL_SOCKET, SO_ERROR) 
specifically) should've returned ECONNREFUSED as shd is down. I am not sure why
this is not the case. Wondering whether there is an issue in TCP/IP (socket)
layer or the way we are using getsockopt.

> 
> </analysis>

--- Additional comment from Raghavendra G on 2016-09-06 01:42:47 EDT ---

> 
> Yes. __socket_connect_finish (or getsockopt (socket, SOL_SOCKET, SO_ERROR) 
> specifically) should've returned ECONNREFUSED as shd is down. I am not sure
> why this is not the case. Wondering whether there is an issue in TCP/IP
> (socket) layer or the way we are using getsockopt.

Is it possible to confirm that no other process is listening on same address
(IP:PORT) as shd?

--- Additional comment from Ravishankar N on 2016-09-06 04:29:09 EDT ---

Looked into the issue with Du, saw that the problem is in gluster rpc code (no
other processes listening on same ip/port).

--- Additional comment from Raghavendra G on 2016-09-06 05:45:55 EDT ---

Following is the control flow we observed:

* connect(2) returned -1 with ECONNREFUSED.
* A shutdown is called on socket, with the intention of processing the failure
in poller thread (why error has to be processed in poller thread when connect
has failed with an error other than EAGAIN/EBUSY/EINPROGRESS?).
* socket is added for polling.
* socket_event_handle is called with all three - pollin, pollout and pollerr -
events set.
* As this is the pollin event received when priv->connected is 0,
socket_connect_finish is called, which in turn doesn't find any error set on
socket. getsockopt not returning any error might be valid as connect had
_failed_ with errno other than EINPROGRESS. connection is deemed successful and
higher layers are notified with RPC_CONNECT event.
* On further execution of socket_event_handle, pollerr is processed and a
RPC_DISCONNECT event is sent to higher layer.

Some things to ponder about:
* why did we receive a pollin/pollout event on socket when connect has failed
clearly?
* why error has to be processed in poller thread when connect has failed with
an error other than EAGAIN/EBUSY/EINPROGRESS?

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1372356
[Bug 1372356] glusterd experiencing repeated connect/disconnect messages
when shd is down
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.