[Gluster-users] Intermittent mount disconnect due to socket poller error
Ryan Lee
ryanlee at zepheira.com
Wed Feb 28 21:30:46 UTC 2018
We've been on the Gluster 3.7 series for several years with things
pretty stable. Given that it's reached EOL, yesterday I upgraded to
3.13.2. Every Gluster mount and server was disabled then brought back
up after the upgrade, changing the op-version to 31302 and then trying
it all out.
It went poorly. Every sizable read and write (100's MB) lead to
'Transport endpoint not connected' errors on the command line and
immediate unavailability of the mount. After unsuccessfully trying to
search for similar problems with solutions, I ended up downgrading to
3.12.6 and changing the op-version to 31202. That brought us back to
usability with the majority of those operations succeeding enough to
consider it online, but there are still occasional mount disconnects
that we never saw with 3.7 - about 6 in the past 18 hours. It seems
these disconnects would never come back, either, unless manually
re-mounted. Manually remounting reconnects immediately. They only
disconnect the affected client, though some simultaneous disconnects
have occurred due to simultaneous activity. The lower-level log info
seems to indicate a socket problem, potentially broken on the client
side based on timing (but the timing is narrow, and I won't claim the
clocks are that well synchronized across all our servers). The client
and one server claim a socket polling error with no data available, and
the other server claims a writev error. This seems to lead the client
to the 'all subvolumes are down' state, even though all other clients
are still connected. Has anybody run into this? Did I miss anything
moving so many versions ahead?
I've included the output of volume info and some excerpts from the logs.
We have two servers running glusterd and two replica volumes with a
brick on each server. Both experience disconnects; there are about 10
clients for each, with one using both. We use SSL over internal IPv4.
Names in all caps were replaced, as were IP addresses.
Let me know if there's anything else I can provide.
% gluster v info VOL
Volume Name: VOL
Type: Replicate
Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: SERVER1:/glusterfs/VOL-brick1/data
Brick2: SERVER2:/glusterfs/VOL-brick2/data
Options Reconfigured:
config.transport: tcp
features.selinux: off
transport.address-family: inet
nfs.disable: on
client.ssl: on
performance.readdir-ahead: on
auth.ssl-allow: [NAMES, including CLIENT]
server.ssl: on
ssl.certificate-depth: 3
Log excerpts (there was nothing related in glusterd.log):
CLIENT:/var/log/glusterfs/mnt-VOL.log
[2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available)
[2018-02-28 19:35:58.477154] E [MSGID: 108006]
[afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All
subvolumes are down. Going offline until atleast one of them comes back up.
[2018-02-28 19:35:58.486146] E [MSGID: 101046]
[dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times>
<lots of saved_frames_unwind messages>
[2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
[2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
<manual umount / mount>
SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
[2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.380530] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection
from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS
3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server)
[2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
[0x7f97bd37ee6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
[0x7f97bcf1f4c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
[0x7f97bcf0abd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.381196] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
[2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT"
[2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
connecting user name: CLIENT
SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
[2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
[2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
0-tcp.VOL-server: poll error on socket
[2018-02-28 19:35:58.509957] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection
from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
0-tcp.VOL-server: not connected (priv->connected = -1)
[2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS
3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server)
[2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
[0x7f85bb7a8e6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
[0x7f85bb3494c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
[0x7f85bb334bd5] ) 0-: Reply submission failed
[2018-02-28 19:35:58.510409] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
[2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT"
[2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
connecting user name: CLIENT
More information about the Gluster-users
mailing list