[Gluster-users] Intermittent mount disconnect due to socket poller error
Ryan Lee
ryanlee at zepheira.com
Wed Mar 7 02:45:11 UTC 2018
I happened to review the status of volume clients and realized they were
reporting a mix of different op-versions: 3.13 clients were still
connecting to the downgraded 3.12 server (likely a timing issue between
downgrading clients and mounting volumes). Remounting the reported
clients has resulted in the correct op-version all around and about a
week free of these errors.
On 2018-03-01 12:38, Ryan Lee wrote:
> Thanks for your response - is there more that would be useful in
> addition to what I already attached? We're logging at default level on
> the brick side and at error on clients. I could turn it up for a few
> days to try to catch this problem in action (it's happened several more
> times since I first wrote).
>
> On 2018-02-28 18:38, Raghavendra Gowdappa wrote:
>> Is it possible to attach logfiles of problematic client and bricks?
>>
>> On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee at zepheira.com
>> <mailto:ryanlee at zepheira.com>> wrote:
>>
>> We've been on the Gluster 3.7 series for several years with things
>> pretty stable. Given that it's reached EOL, yesterday I upgraded to
>> 3.13.2. Every Gluster mount and server was disabled then brought
>> back up after the upgrade, changing the op-version to 31302 and then
>> trying it all out.
>>
>> It went poorly. Every sizable read and write (100's MB) lead to
>> 'Transport endpoint not connected' errors on the command line and
>> immediate unavailability of the mount. After unsuccessfully trying
>> to search for similar problems with solutions, I ended up
>> downgrading to 3.12.6 and changing the op-version to 31202. That
>> brought us back to usability with the majority of those operations
>> succeeding enough to consider it online, but there are still
>> occasional mount disconnects that we never saw with 3.7 - about 6 in
>> the past 18 hours. It seems these disconnects would never come
>> back, either, unless manually re-mounted. Manually remounting
>> reconnects immediately. They only disconnect the affected client,
>> though some simultaneous disconnects have occurred due to
>> simultaneous activity. The lower-level log info seems to indicate a
>> socket problem, potentially broken on the client side based on
>> timing (but the timing is narrow, and I won't claim the clocks are
>> that well synchronized across all our servers). The client and one
>> server claim a socket polling error with no data available, and the
>> other server claims a writev error. This seems to lead the client
>> to the 'all subvolumes are down' state, even though all other
>> clients are still connected. Has anybody run into this? Did I miss
>> anything moving so many versions ahead?
>>
>> I've included the output of volume info and some excerpts from the
>> logs. We have two servers running glusterd and two replica volumes
>> with a brick on each server. Both experience disconnects; there are
>> about 10 clients for each, with one using both. We use SSL over
>> internal IPv4. Names in all caps were replaced, as were IP addresses.
>>
>> Let me know if there's anything else I can provide.
>>
>> % gluster v info VOL
>> Volume Name: VOL
>> Type: Replicate
>> Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: SERVER1:/glusterfs/VOL-brick1/data
>> Brick2: SERVER2:/glusterfs/VOL-brick2/data
>> Options Reconfigured:
>> config.transport: tcp
>> features.selinux: off
>> transport.address-family: inet
>> nfs.disable: on
>> client.ssl: on
>> performance.readdir-ahead: on
>> auth.ssl-allow: [NAMES, including CLIENT]
>> server.ssl: on
>> ssl.certificate-depth: 3
>>
>> Log excerpts (there was nothing related in glusterd.log):
>>
>> CLIENT:/var/log/glusterfs/mnt-VOL.log
>> [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
>> 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data
>> available)
>> [2018-02-28 19:35:58.477154] E [MSGID: 108006]
>> [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0:
>> All subvolumes are down. Going offline until atleast one of them
>> comes back up.
>> [2018-02-28 19:35:58.486146] E [MSGID: 101046]
>> [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67
>> times>
>> <lots of saved_frames_unwind messages>
>> [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
>> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data
>> available)
>> [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
>> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data
>> available)
>>
>> <manual umount / mount>
>>
>>
>> SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
>> [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
>> 0-tcp.VOL-server: poll error on socket
>> [2018-02-28 19:35:58.380530] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
>> connection from
>> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
>> [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
>> 0-tcp.VOL-server: not connected (priv->connected = -1)
>> [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
>> 0-rpc-service: failed to submit message (XID: 0xa4e, Program:
>> GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
>> (tcp.uploads-server)
>> [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
>>
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
>>
>> [0x7f97bd37ee6a]
>>
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
>>
>> [0x7f97bcf1f4c8]
>>
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
>>
>> [0x7f97bcf0abd5] ) 0-: Reply submission failed
>> [2018-02-28 19:35:58.381196] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
>> connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
>> [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
>> 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr =
>> "CLIENT"
>> [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
>> connecting user name: CLIENT
>>
>> SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
>> [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
>> 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
>> [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
>> 0-tcp.VOL-server: poll error on socket
>> [2018-02-28 19:35:58.509957] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
>> connection from
>> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
>> [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
>> 0-tcp.VOL-server: not connected (priv->connected = -1)
>> [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
>> 0-rpc-service: failed to submit message (XID: 0x4b3f, Program:
>> GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
>> (tcp.VOL-server)
>> [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
>>
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
>>
>> [0x7f85bb7a8e6a]
>>
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
>>
>> [0x7f85bb3494c8]
>>
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
>>
>> [0x7f85bb334bd5] ) 0-: Reply submission failed
>> [2018-02-28 19:35:58.510409] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
>> connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
>> [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
>> 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr =
>> "CLIENT"
>> [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
>> connecting user name: CLIENT
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>> <http://lists.gluster.org/mailman/listinfo/gluster-users>
>>
>>
>
More information about the Gluster-users
mailing list