[Gluster-users] Intermittent mount disconnect due to socket poller error

Thu Mar 1 02:38:06 UTC 2018

Is it possible to attach logfiles of problematic client and bricks?

On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee at zepheira.com> wrote:

> We've been on the Gluster 3.7 series for several years with things pretty
> stable.  Given that it's reached EOL, yesterday I upgraded to 3.13.2.
> Every Gluster mount and server was disabled then brought back up after the
> upgrade, changing the op-version to 31302 and then trying it all out.
>
> It went poorly.  Every sizable read and write (100's MB) lead to
> 'Transport endpoint not connected' errors on the command line and immediate
> unavailability of the mount.  After unsuccessfully trying to search for
> similar problems with solutions, I ended up downgrading to 3.12.6 and
> changing the op-version to 31202.  That brought us back to usability with
> the majority of those operations succeeding enough to consider it online,
> but there are still occasional mount disconnects that we never saw with 3.7
> - about 6 in the past 18 hours.  It seems these disconnects would never
> come back, either, unless manually re-mounted.  Manually remounting
> reconnects immediately.  They only disconnect the affected client, though
> some simultaneous disconnects have occurred due to simultaneous activity.
> The lower-level log info seems to indicate a socket problem, potentially
> broken on the client side based on timing (but the timing is narrow, and I
> won't claim the clocks are that well synchronized across all our servers).
> The client and one server claim a socket polling error with no data
> available, and the other server claims a writev error.  This seems to lead
> the client to the 'all subvolumes are down' state, even though all other
> clients are still connected.  Has anybody run into this?  Did I miss
> anything moving so many versions ahead?
>
> I've included the output of volume info and some excerpts from the logs.
>  We have two servers running glusterd and two replica volumes with a brick
> on each server.  Both experience disconnects; there are about 10 clients
> for each, with one using both.  We use SSL over internal IPv4. Names in all
> caps were replaced, as were IP addresses.
>
> Let me know if there's anything else I can provide.
>
> % gluster v info VOL
> Volume Name: VOL
> Type: Replicate
> Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: SERVER1:/glusterfs/VOL-brick1/data
> Brick2: SERVER2:/glusterfs/VOL-brick2/data
> Options Reconfigured:
> config.transport: tcp
> features.selinux: off
> transport.address-family: inet
> nfs.disable: on
> client.ssl: on
> performance.readdir-ahead: on
> auth.ssl-allow: [NAMES, including CLIENT]
> server.ssl: on
> ssl.certificate-depth: 3
>
> Log excerpts (there was nothing related in glusterd.log):
>
> CLIENT:/var/log/glusterfs/mnt-VOL.log
> [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available)
> [2018-02-28 19:35:58.477154] E [MSGID: 108006]
> [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All
> subvolumes are down. Going offline until atleast one of them comes back up.
> [2018-02-28 19:35:58.486146] E [MSGID: 101046]
> [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times>
> <lots of saved_frames_unwind messages>
> [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
> [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
>
> <manual umount / mount>
>
>
> SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
> [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
> 0-tcp.VOL-server: poll error on socket
> [2018-02-28 19:35:58.380530] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection
> from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
> [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
> 0-tcp.VOL-server: not connected (priv->connected = -1)
> [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS
> 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server)
> [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
> [0x7f97bd37ee6a] -->/usr/lib/x86_64-linux-gnu/g
> lusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f97bcf1f4c8]
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
> [0x7f97bcf0abd5] ) 0-: Reply submission failed
> [2018-02-28 19:35:58.381196] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection
> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
> [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
> 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT"
> [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
> connecting user name: CLIENT
>
> SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
> [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
> 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
> [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
> 0-tcp.VOL-server: poll error on socket
> [2018-02-28 19:35:58.509957] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection
> from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
> [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
> 0-tcp.VOL-server: not connected (priv->connected = -1)
> [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS
> 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server)
> [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
> [0x7f85bb7a8e6a] -->/usr/lib/x86_64-linux-gnu/g
> lusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f85bb3494c8]
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
> [0x7f85bb334bd5] ) 0-: Reply submission failed
> [2018-02-28 19:35:58.510409] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection
> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
> [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
> 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT"
> [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
> connecting user name: CLIENT
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180301/3da08f5a/attachment.html>