[Bugs] [Bug 1351672] New: Possible bug in portmap on client reconnection

Thu Jun 30 14:23:01 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1351672

            Bug ID: 1351672
           Summary: Possible bug in portmap on client reconnection
           Product: GlusterFS
           Version: 3.7.11
         Component: protocol
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: oleksandr at natalenko.name
                CC: bugs at gluster.org

Occasionally in replica 2 setup one of the node (1.2.3.22) gets the following
errors in /var/log/glusterfs/etc-glusterfs-glusterd.vol.log file:

===
[2016-06-30 11:00:27.034268] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.22:65488
[2016-06-30 11:00:27.034487] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully
[2016-06-30 11:00:30.029924] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.77:65530  
[2016-06-30 11:00:30.029957] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully  
[2016-06-30 11:00:30.029995] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.77:65530
[2016-06-30 11:00:30.030005] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully
[2016-06-30 11:00:32.380737] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.60:65534 
===

1.2.3.22 and 1.2.3.77 are GlusterFS server nodes (replica 2), and 1.2.3.60 is
FUSE client.

After those warnings/errors the following happens on the client:

===
[2016-06-30 11:00:32.351619] C
[rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
0-asterisk_agents_videos-client-2: server 1.2.3.22:49214 has not responded in
the last 10 seconds, disco
nnecting.
[2016-06-30 11:00:32.361727] E [rpc-clnt.c:370:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd4a47f8b42] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1d
e)[0x7fd4a45c376e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fd4a45c387e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fd4a45c4fe4] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7fd4a45c5828] )))))
0-asterisk_agents_videos-client-2: forced unwinding frame type(GlusterFS 3.3)
op(WRITE(13)) called at 2016-06-30 11:00:17.764449 (xid=0xa9865d)
[2016-06-30 11:00:32.361755] W [MSGID: 114031]
[client-rpc-fops.c:907:client3_3_writev_cbk] 0-asterisk_agents_videos-client-2:
remote operation failed [Transport endpoint is not connected]
[2016-06-30 11:00:32.378754] I [socket.c:3304:socket_submit_request]
0-asterisk_agents_videos-client-2: not connected (priv->connected = 0)
[2016-06-30 11:00:32.380399] W [rpc-clnt.c:1644:rpc_clnt_submit]
0-asterisk_agents_videos-client-2: failed to submit rpc-request (XID: 0xa98681
Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport
(asterisk_agents_videos-client-2)

…

[2016-06-30 11:00:38.177288] W [MSGID: 114031]
[client-rpc-fops.c:2974:client3_3_lookup_cbk]
0-asterisk_agents_videos-client-2: remote operation failed. Path: /2016
(ab47c5bd-c117-4423-b624-b6f88e243304) [Transport endpoint is not connected]  
[2016-06-30 11:00:38.182921] W [MSGID: 114031]
[client-rpc-fops.c:2974:client3_3_lookup_cbk]
0-asterisk_agents_videos-client-2: remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]  
[2016-06-30 11:00:38.186938] E [MSGID: 114031]
[client-rpc-fops.c:1730:client3_3_entrylk_cbk]
0-asterisk_agents_videos-client-2: remote operation failed [Transport endpoint
is not connected]
…
===

As far as I understand, client have lost a connection to one of the bricks in
distributed-replicated volume. Kaushal explained me in #gluster:

===
[15:51] <kshlm> That error says that the process doesn't understand the
GLusterFS FOP RPCs.
[15:51] <kshlm> Sometimes a client tries to send requests to glusterd instead
of bricks.
[15:52] <kshlm> I've seen it sporadically, but never bothered to root cause it.
[15:52] <kshlm> It could possibly occur because of a client portmap request
failing.
[15:53] <kshlm> A client translator needs to connect to glusterd to get the
port of a brick by doing a portmap query.
[15:53] <kshlm> It should reconnect to the brick port once it gets the port.
[15:54] <kshlm> But maybe sometimes this reconnection doesn't happen.
[15:54] <kshlm> An client xlator continues to use the glusterd connection
itself to send IO requests.
[15:55] <kshlm> The portmap query and reconnection code is not very clean. This
is something I'm hoping to get fixed soon.
[15:56] <kshlm> You may have to restart those client processes to get them to
reconnect correctly to the bricks.
===

Restarting the client fixed the issue, and entries in replica started to heal.

Also, another hint from Kaushal:

===
[17:04] <kshlm> This patch https://review.gluster.org/14254 is cleaning up
portmap reconnection.
[17:05] <kshlm> It mentions encryption in the title, but the changed to
reconnection apply to unencrypted connections as well.
===

The issue happened several times for us, and once it led to split-brain that
required manual intervention.

Also, custom gfapi was affected by this as well (however, another volume and
another brick was involved), and the app was restarted to fix the issue. Here
is short log from this app:

===
Jun 30 11:00:27 1.2.3.22 ottd[6785]: [2016-06-30 11:00:27.021060] C
[rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-ott_chunks-client-2: server
1.2.3.22:49217 has not responded in the last 10 seconds, disconnecting.
===

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.