[Gluster-users] Intermittent mount disconnect due to socket poller error

Ryan Lee ryanlee at zepheira.com
Wed Feb 28 21:30:46 UTC 2018


We've been on the Gluster 3.7 series for several years with things 
pretty stable.  Given that it's reached EOL, yesterday I upgraded to 
3.13.2.  Every Gluster mount and server was disabled then brought back 
up after the upgrade, changing the op-version to 31302 and then trying 
it all out.

It went poorly.  Every sizable read and write (100's MB) lead to 
'Transport endpoint not connected' errors on the command line and 
immediate unavailability of the mount.  After unsuccessfully trying to 
search for similar problems with solutions, I ended up downgrading to 
3.12.6 and changing the op-version to 31202.  That brought us back to 
usability with the majority of those operations succeeding enough to 
consider it online, but there are still occasional mount disconnects 
that we never saw with 3.7 - about 6 in the past 18 hours.  It seems 
these disconnects would never come back, either, unless manually 
re-mounted.  Manually remounting reconnects immediately.  They only 
disconnect the affected client, though some simultaneous disconnects 
have occurred due to simultaneous activity.  The lower-level log info 
seems to indicate a socket problem, potentially broken on the client 
side based on timing (but the timing is narrow, and I won't claim the 
clocks are that well synchronized across all our servers).  The client 
and one server claim a socket polling error with no data available, and 
the other server claims a writev error.  This seems to lead the client 
to the 'all subvolumes are down' state, even though all other clients 
are still connected.  Has anybody run into this?  Did I miss anything 
moving so many versions ahead?

I've included the output of volume info and some excerpts from the logs. 
   We have two servers running glusterd and two replica volumes with a 
brick on each server.  Both experience disconnects; there are about 10 
clients for each, with one using both.  We use SSL over internal IPv4. 
Names in all caps were replaced, as were IP addresses.

Let me know if there's anything else I can provide.

% gluster v info VOL
Volume Name: VOL
Type: Replicate
Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: SERVER1:/glusterfs/VOL-brick1/data
Brick2: SERVER2:/glusterfs/VOL-brick2/data
Options Reconfigured:
config.transport: tcp
features.selinux: off
transport.address-family: inet
nfs.disable: on
client.ssl: on
performance.readdir-ahead: on
auth.ssl-allow: [NAMES, including CLIENT]
server.ssl: on
ssl.certificate-depth: 3

Log excerpts (there was nothing related in glusterd.log):

CLIENT:/var/log/glusterfs/mnt-VOL.log
[2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller] 
0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available)
[2018-02-28 19:35:58.477154] E [MSGID: 108006] 
[afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All 
subvolumes are down. Going offline until atleast one of them comes back up.
[2018-02-28 19:35:58.486146] E [MSGID: 101046] 
[dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times>
<lots of saved_frames_unwind messages>
[2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller] 
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
[2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller] 
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)

<manual umount / mount>


SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
[2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller] 
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.380530] I [MSGID: 115036] 
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection 
from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply] 
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic] 
0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS 
3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server)
[2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply] 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) 
[0x7f97bd37ee6a] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) 
[0x7f97bcf1f4c8] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) 
[0x7f97bcf0abd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.381196] I [MSGID: 101055] 
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection 
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update] 
0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT"
[2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login: 
connecting user name: CLIENT

SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
[2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv] 
0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
[2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller] 
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.509957] I [MSGID: 115036] 
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection 
from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply] 
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic] 
0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS 
3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server)
[2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply] 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) 
[0x7f85bb7a8e6a] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) 
[0x7f85bb3494c8] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) 
[0x7f85bb334bd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.510409] I [MSGID: 101055] 
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection 
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update] 
0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT"
[2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login: 
connecting user name: CLIENT


More information about the Gluster-users mailing list