[Gluster-users] Intermittent mount disconnect due to socket poller error

Ryan Lee ryanlee at zepheira.com
Wed Mar 7 02:45:11 UTC 2018


I happened to review the status of volume clients and realized they were 
reporting a mix of different op-versions: 3.13 clients were still 
connecting to the downgraded 3.12 server (likely a timing issue between 
downgrading clients and mounting volumes).  Remounting the reported 
clients has resulted in the correct op-version all around and about a 
week free of these errors.

On 2018-03-01 12:38, Ryan Lee wrote:
> Thanks for your response - is there more that would be useful in 
> addition to what I already attached?  We're logging at default level on 
> the brick side and at error on clients.  I could turn it up for a few 
> days to try to catch this problem in action (it's happened several more 
> times since I first wrote).
> 
> On 2018-02-28 18:38, Raghavendra Gowdappa wrote:
>> Is it possible to attach logfiles of problematic client and bricks?
>>
>> On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee at zepheira.com 
>> <mailto:ryanlee at zepheira.com>> wrote:
>>
>>     We've been on the Gluster 3.7 series for several years with things
>>     pretty stable.  Given that it's reached EOL, yesterday I upgraded to
>>     3.13.2.  Every Gluster mount and server was disabled then brought
>>     back up after the upgrade, changing the op-version to 31302 and then
>>     trying it all out.
>>
>>     It went poorly.  Every sizable read and write (100's MB) lead to
>>     'Transport endpoint not connected' errors on the command line and
>>     immediate unavailability of the mount.  After unsuccessfully trying
>>     to search for similar problems with solutions, I ended up
>>     downgrading to 3.12.6 and changing the op-version to 31202.  That
>>     brought us back to usability with the majority of those operations
>>     succeeding enough to consider it online, but there are still
>>     occasional mount disconnects that we never saw with 3.7 - about 6 in
>>     the past 18 hours.  It seems these disconnects would never come
>>     back, either, unless manually re-mounted.  Manually remounting
>>     reconnects immediately.  They only disconnect the affected client,
>>     though some simultaneous disconnects have occurred due to
>>     simultaneous activity.  The lower-level log info seems to indicate a
>>     socket problem, potentially broken on the client side based on
>>     timing (but the timing is narrow, and I won't claim the clocks are
>>     that well synchronized across all our servers).  The client and one
>>     server claim a socket polling error with no data available, and the
>>     other server claims a writev error.  This seems to lead the client
>>     to the 'all subvolumes are down' state, even though all other
>>     clients are still connected.  Has anybody run into this?  Did I miss
>>     anything moving so many versions ahead?
>>
>>     I've included the output of volume info and some excerpts from the
>>     logs.   We have two servers running glusterd and two replica volumes
>>     with a brick on each server.  Both experience disconnects; there are
>>     about 10 clients for each, with one using both.  We use SSL over
>>     internal IPv4. Names in all caps were replaced, as were IP addresses.
>>
>>     Let me know if there's anything else I can provide.
>>
>>     % gluster v info VOL
>>     Volume Name: VOL
>>     Type: Replicate
>>     Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
>>     Status: Started
>>     Snapshot Count: 0
>>     Number of Bricks: 1 x 2 = 2
>>     Transport-type: tcp
>>     Bricks:
>>     Brick1: SERVER1:/glusterfs/VOL-brick1/data
>>     Brick2: SERVER2:/glusterfs/VOL-brick2/data
>>     Options Reconfigured:
>>     config.transport: tcp
>>     features.selinux: off
>>     transport.address-family: inet
>>     nfs.disable: on
>>     client.ssl: on
>>     performance.readdir-ahead: on
>>     auth.ssl-allow: [NAMES, including CLIENT]
>>     server.ssl: on
>>     ssl.certificate-depth: 3
>>
>>     Log excerpts (there was nothing related in glusterd.log):
>>
>>     CLIENT:/var/log/glusterfs/mnt-VOL.log
>>     [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
>>     0-VOL-client-1: socket_poller SERVER2:49153 failed (No data 
>> available)
>>     [2018-02-28 19:35:58.477154] E [MSGID: 108006]
>>     [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0:
>>     All subvolumes are down. Going offline until atleast one of them
>>     comes back up.
>>     [2018-02-28 19:35:58.486146] E [MSGID: 101046]
>>     [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67
>>     times>
>>     <lots of saved_frames_unwind messages>
>>     [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
>>     0-VOL-client-1: socket_poller SERVER2:24007 failed (No data 
>> available)
>>     [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
>>     0-VOL-client-1: socket_poller SERVER2:24007 failed (No data 
>> available)
>>
>>     <manual umount / mount>
>>
>>
>>     SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
>>     [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
>>     0-tcp.VOL-server: poll error on socket
>>     [2018-02-28 19:35:58.380530] I [MSGID: 115036]
>>     [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
>>     connection from 
>> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
>>     [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
>>     0-tcp.VOL-server: not connected (priv->connected = -1)
>>     [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
>>     0-rpc-service: failed to submit message (XID: 0xa4e, Program:
>>     GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
>>     (tcp.uploads-server)
>>     [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
>>     
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) 
>>
>>     [0x7f97bd37ee6a]
>>     
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) 
>>
>>     [0x7f97bcf1f4c8]
>>     
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) 
>>
>>     [0x7f97bcf0abd5] ) 0-: Reply submission failed
>>     [2018-02-28 19:35:58.381196] I [MSGID: 101055]
>>     [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
>>     connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
>>     [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
>>     0-/glusterfs/uploads-brick2/data: allowed = "*", received addr =
>>     "CLIENT"
>>     [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
>>     connecting user name: CLIENT
>>
>>     SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
>>     [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
>>     0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
>>     [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
>>     0-tcp.VOL-server: poll error on socket
>>     [2018-02-28 19:35:58.509957] I [MSGID: 115036]
>>     [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
>>     connection from 
>> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
>>     [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
>>     0-tcp.VOL-server: not connected (priv->connected = -1)
>>     [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
>>     0-rpc-service: failed to submit message (XID: 0x4b3f, Program:
>>     GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
>>     (tcp.VOL-server)
>>     [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
>>     
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) 
>>
>>     [0x7f85bb7a8e6a]
>>     
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) 
>>
>>     [0x7f85bb3494c8]
>>     
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) 
>>
>>     [0x7f85bb334bd5] ) 0-: Reply submission failed
>>     [2018-02-28 19:35:58.510409] I [MSGID: 101055]
>>     [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
>>     connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
>>     [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
>>     0-/glusterfs/uploads-brick1/data: allowed = "*", received addr =
>>     "CLIENT"
>>     [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
>>     connecting user name: CLIENT
>>     _______________________________________________
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>     http://lists.gluster.org/mailman/listinfo/gluster-users
>>     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>>
>>
> 



More information about the Gluster-users mailing list