[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

Thu Oct 25 14:16:09 UTC 2018

Thank you.

I just upgraded my nodes to 5.0.1 and everything seems to be running
smoothly. Plus the cluster.data-self-heal=off reconfigured option has
gone away during the update, so I guess I'm back on nominal.

    Hoggins!

Le 24/10/2018 à 13:57, Ravishankar N a écrit :
>
>
>
> On 10/24/2018 05:16 PM, Hoggins! wrote:
>> Thank you, it's working as expected.
>>
>> I guess it's only safe to put cluster.data-self-heal back on when I get
>> an updated version of GlusterFS?
> Yes correct. Also, you would still need to restart shd whenever you
> hit this issue until upgrade.
> -Ravi
>>     Hoggins!
>>
>> Le 24/10/2018 à 11:53, Ravishankar N a écrit :
>>> On 10/24/2018 02:38 PM, Hoggins! wrote:
>>>> Thanks, that's helping a lot, I will do that.
>>>>
>>>> One more question: should the glustershd restart be performed on the
>>>> arbiter only, or on each node of the cluster?
>>> If you do a 'gluster volume start volname force' it will restart the
>>> shd on all nodes.
>>> -Ravi
>>>> Thanks!
>>>>
>>>>     Hoggins!
>>>>
>>>> Le 24/10/2018 à 02:55, Ravishankar N a écrit :
>>>>> On 10/23/2018 10:01 PM, Hoggins! wrote:
>>>>>> Hello there,
>>>>>>
>>>>>> I'm stumbling upon the *exact same issue*, and unfortunately setting the
>>>>>> server.tcp-user-timeout to 42 does not help.
>>>>>> Any other suggestion?
>>>>>>
>>>>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
>>>>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
>>>>>> node) suggested by Sam works, but it often doesn't.
>>>>>>
>>>>>> You may ask how I got into this, well it's simple: I needed to replace
>>>>>> my brick 1 and brick 2 with two brand new machines, so here's what I did:
>>>>>>     - add brick 3 and brick 4 into the cluster (gluster peer probe,
>>>>>> gluster volume add-brick, etc., with the issue regarding the arbiter
>>>>>> node that has to be first removed from the cluster before being able to
>>>>>> add bricks 3 and 4)
>>>>>>     - wait for all the files on my volumes to heal. It took a few days.
>>>>>>     - remove bricks 1 and 2
>>>>>>     - after having "reset" the arbiter, re-add the arbiter into the cluster
>>>>>>
>>>>>> And now it's intermittently hanging on writing *on existing files*.
>>>>>> There is *no problem for writing new files* on the volumes.
>>>>> Hi,
>>>>>
>>>>> There was a arbiter volume hang issue  that was fixed [1] recently.
>>>>> The fix has been back-ported to all release branches.
>>>>>
>>>>> One workaround to overcome hangs is to (1)turn off  'testvol
>>>>> cluster.data-self-heal', remount the clients *and* (2) restart
>>>>> glustershd (via volume start force). The hang is observed due to an
>>>>> unreleased lock from self-heal. There are other ways to release the
>>>>> stale lock via gluster clear-locks command or tweaking
>>>>> features.locks-revocation-secs but restarting shd whenever you see the
>>>>> issue is the easiest and safest way.
>>>>>
>>>>> -Ravi
>>>>>
>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>>>>>
>>>>>
>>>>>> I'm lost here, thanks for your inputs!
>>>>>>
>>>>>>     Hoggins!
>>>>>>
>>>>>> Le 14/09/2018 à 04:16, Amar Tumballi a écrit :
>>>>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net
>>>>>>> <mailto:mailinglists at smcleod.net>> wrote:
>>>>>>>
>>>>>>>     I apologise for this being posted twice - I'm not sure if that was
>>>>>>>     user error or a bug in the mailing list, but the list wasn't
>>>>>>>     showing my post after quite some time so I sent a second email
>>>>>>>     which near immediately showed up - that's mailing lists I guess...
>>>>>>>
>>>>>>>     Anyway, if anyone has any input, advice or abuse I'm welcome any
>>>>>>>     input!
>>>>>>>
>>>>>>>
>>>>>>> We got little late to get back on this. But after running tests
>>>>>>> internally, we found possibly missing an volume option is the reason
>>>>>>> for this:
>>>>>>>
>>>>>>> Try 
>>>>>>>
>>>>>>> gluster volume set <volname> server.tcp-user-timeout 42
>>>>>>> on your volume. Let us know if this helps.
>>>>>>> (Ref: https://review.gluster.org/#/c/glusterfs/+/21170/)
>>>>>>>  
>>>>>>>
>>>>>>>     --
>>>>>>>     Sam McLeod
>>>>>>>     https://smcleod.net
>>>>>>>     https://twitter.com/s_mcleod
>>>>>>>
>>>>>>>>     On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net
>>>>>>>>     <mailto:mailinglists at smcleod.net>> wrote:
>>>>>>>>
>>>>>>>>     We've got an odd problem where clients are blocked from writing
>>>>>>>>     to Gluster volumes until the first node of the Gluster cluster is
>>>>>>>>     rebooted.
>>>>>>>>
>>>>>>>>     I suspect I've either configured something incorrectly with the
>>>>>>>>     arbiter / replica configuration of the volumes, or there is some
>>>>>>>>     sort of bug in the gluster client-server connection that we're
>>>>>>>>     triggering.
>>>>>>>>
>>>>>>>>     I was wondering if anyone has seen this or could point me in the
>>>>>>>>     right direction?
>>>>>>>>
>>>>>>>>
>>>>>>>>     *Environment:*
>>>>>>>>
>>>>>>>>       * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
>>>>>>>>         metadata only).
>>>>>>>>       * Version: Client and Servers both running 4.1.3, both on
>>>>>>>>         CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
>>>>>>>>         networked SSD storage backing them, XFS.
>>>>>>>>       * Client: Native Gluster FUSE client mounting via the
>>>>>>>>         kubernetes provider
>>>>>>>>
>>>>>>>>
>>>>>>>>     *Problem:*
>>>>>>>>
>>>>>>>>       * Seemingly randomly some clients will be blocked / are unable
>>>>>>>>         to write to what should be a highly available gluster volume.
>>>>>>>>       * The client gluster logs show it failing to do new file
>>>>>>>>         operations across various volumes and all three nodes of the
>>>>>>>>         gluster.
>>>>>>>>       * The server gluster (or OS) logs do not show any warnings or
>>>>>>>>         errors.
>>>>>>>>       * The client recovers and is able to write to volumes again
>>>>>>>>         after the first node of the gluster cluster is rebooted.
>>>>>>>>       * Until the first node of the gluster cluster is rebooted, the
>>>>>>>>         client fails to write to the volume that is (or should be)
>>>>>>>>         available on the second node (a replica) and third node (an
>>>>>>>>         arbiter only node).
>>>>>>>>
>>>>>>>>
>>>>>>>>     *What 'fixes' the issue:*
>>>>>>>>
>>>>>>>>       * Although the clients (kubernetes hosts) connect to all 3
>>>>>>>>         nodes of the Gluster cluster - restarting the first gluster
>>>>>>>>         node always unblocks the IO and allows the client to continue
>>>>>>>>         writing.
>>>>>>>>       * Stopping and starting the glusterd service on the gluster
>>>>>>>>         server is not enough to fix the issue, nor is restarting its
>>>>>>>>         networking.
>>>>>>>>       * This suggests to me that the volume is unavailable for
>>>>>>>>         writing for some reason and restarting the first node in the
>>>>>>>>         cluster either clears some sort of TCP sessions between the
>>>>>>>>         client-server or between the server-server replication.
>>>>>>>>
>>>>>>>>
>>>>>>>>     *Expected behaviour:*
>>>>>>>>
>>>>>>>>       * If the first gluster node / server had failed or was blocked
>>>>>>>>         from performing operations for some reason (which it doesn't
>>>>>>>>         seem it is), I'd expect the clients to access data from the
>>>>>>>>         second gluster node and write metadata to the third gluster
>>>>>>>>         node as well as it's an arbiter / metadata only node.
>>>>>>>>       * If for some reason the a gluster node was not able to serve
>>>>>>>>         connections to clients, I'd expect to see errors in the
>>>>>>>>         volume, glusterd or brick log files (there are none on the
>>>>>>>>         first gluster node).
>>>>>>>>       * If the first gluster node was for some reason blocking IO on
>>>>>>>>         a volume, I'd expect that node either to show as unhealthy or
>>>>>>>>         unavailable in the gluster peer status or gluster volume status.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     *Client gluster errors:*
>>>>>>>>
>>>>>>>>       * staging_static in this example is a volume name.
>>>>>>>>       * You can see the client trying to connect to the second and
>>>>>>>>         third nodes of the gluster cluster and failing (unsure as to
>>>>>>>>         why?)
>>>>>>>>       * The server side logs on the first gluster node do not show
>>>>>>>>         any errors or problems, but the second / third node show
>>>>>>>>         errors in the glusterd.log when trying to 'unlock' the
>>>>>>>>         0-management volume on the first node.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     *On a gluster client* (a kubernetes host using the kubernetes
>>>>>>>>     connector which uses the native fuse client) when its blocked
>>>>>>>>     from writing but the gluster appears healthy (other than the
>>>>>>>>     errors mentioned later):
>>>>>>>>
>>>>>>>>     [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02
>>>>>>>>     15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02
>>>>>>>>     15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02
>>>>>>>>     16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02
>>>>>>>>     16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02
>>>>>>>>     17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02
>>>>>>>>     17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02
>>>>>>>>     18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02
>>>>>>>>     18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02
>>>>>>>>     19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02
>>>>>>>>     19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02
>>>>>>>>     20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02
>>>>>>>>     20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02
>>>>>>>>     21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02
>>>>>>>>     21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02
>>>>>>>>     22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02
>>>>>>>>     22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02
>>>>>>>>     23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02
>>>>>>>>     23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03
>>>>>>>>     00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03
>>>>>>>>     00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154
>>>>>>>>     [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-1: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>     [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail]
>>>>>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>>>>>>>     v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03
>>>>>>>>     01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154
>>>>>>>>     [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>>     0-staging_static-client-2: remote operation failed [Transport
>>>>>>>>     endpoint is not connected]
>>>>>>>>
>>>>>>>>
>>>>>>>>     *On the second gluster server:*
>>>>>>>>
>>>>>>>>
>>>>>>>>     We are seeing the following error in the glusterd.log file when
>>>>>>>>     the client is blocked from writing the volume, I think this is
>>>>>>>>     probably the most important information about the error and
>>>>>>>>     suggests a problem with the first node but doesn't explain the
>>>>>>>>     client behaviour:
>>>>>>>>
>>>>>>>>     [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>>>>>>>     [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management:
>>>>>>>>     Unlocking failed on <FQDN of the first gluster node>. Please
>>>>>>>>     check log file for details.
>>>>>>>>     [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>>>>>>>     [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed
>>>>>>>>     to unlock on some peer(s)
>>>>>>>>
>>>>>>>>     Note in the above error:
>>>>>>>>
>>>>>>>>     1. I'm not sure which log to check (there doesn't seem to be a
>>>>>>>>     management brick / brick log)?
>>>>>>>>     2. If there's a problem with the first node, why isn't it
>>>>>>>>     rejected from the gluster / taken offline / the health of the
>>>>>>>>     peers or volume list degraded?
>>>>>>>>     3. Why does the client fail to write to the volume rather than
>>>>>>>>     (I'm assuming) trying the second (or third I guess) node to write
>>>>>>>>     to the volume?
>>>>>>>>
>>>>>>>>
>>>>>>>>     We are also seeing the following errors repeated a lot in the
>>>>>>>>     logs, both when the volumes are working and when there's an issue
>>>>>>>>     in the brick log
>>>>>>>>     (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>>>>>>>
>>>>>>>>     [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply]
>>>>>>>>     (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>>     [0x7f8470319d14]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>>     [0x7f846bdde24a]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>>     [2018-09-03 01:58:35.128957] E
>>>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>>>>>>>     submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers:
>>>>>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>>>>>     [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply]
>>>>>>>>     (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>>     [0x7f8470319d14]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>>     [0x7f846bdde24a]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>>     [2018-09-03 01:58:35.129016] E
>>>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>>>>>>>     submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers:
>>>>>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>>>>>     [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply]
>>>>>>>>     (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>>     [0x7f8470319d14]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>>     [0x7f846bdde24a]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>>     [2018-09-03 01:58:35.129077] E
>>>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>>>>>>>     submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers:
>>>>>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>>>>>     [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply]
>>>>>>>>     (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>>     [0x7f8470319d14]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>>     [0x7f846bdde24a]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>>     [2018-09-03 01:58:35.129191] E
>>>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>>>>>>>     submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers:
>>>>>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>>>>>     [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply]
>>>>>>>>     (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>>     [0x7f8470319d14]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>>     [0x7f846bdde24a]
>>>>>>>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     *Gluster volume information:*
>>>>>>>>
>>>>>>>>
>>>>>>>>     # gluster volume info staging_static
>>>>>>>>
>>>>>>>>     Volume Name: staging_static
>>>>>>>>     Type: Replicate
>>>>>>>>     Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>>>>>>>     Status: Started
>>>>>>>>     Snapshot Count: 0
>>>>>>>>     Number of Bricks: 1 x (2 + 1) = 3
>>>>>>>>     Transport-type: tcp
>>>>>>>>     Bricks:
>>>>>>>>     Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>>>     Brick2: <second gluster
>>>>>>>>     node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>>>     Brick3: <third gluster
>>>>>>>>     node.fqdn>:/mnt/gluster-storage/staging_static (arbiter)
>>>>>>>>     Options Reconfigured:
>>>>>>>>     storage.fips-mode-rchecksum: true
>>>>>>>>     cluster.self-heal-window-size: 16
>>>>>>>>     cluster.shd-wait-qlength: 4096
>>>>>>>>     cluster.shd-max-threads: 8
>>>>>>>>     performance.cache-min-file-size: 2KB
>>>>>>>>     performance.rda-cache-limit: 1GB
>>>>>>>>     network.inode-lru-limit: 50000
>>>>>>>>     server.outstanding-rpc-limit: 256
>>>>>>>>     transport.listen-backlog: 2048
>>>>>>>>     performance.write-behind-window-size: 512MB
>>>>>>>>     performance.stat-prefetch: true
>>>>>>>>     performance.io <http://performance.io/>-thread-count: 16
>>>>>>>>     performance.client-io-threads: true
>>>>>>>>     performance.cache-size: 1GB
>>>>>>>>     performance.cache-refresh-timeout: 60
>>>>>>>>     performance.cache-invalidation: true
>>>>>>>>     cluster.use-compound-fops: true
>>>>>>>>     cluster.readdir-optimize: true
>>>>>>>>     cluster.lookup-optimize: true
>>>>>>>>     cluster.favorite-child-policy: size
>>>>>>>>     cluster.eager-lock: true
>>>>>>>>     client.event-threads: 4
>>>>>>>>     nfs.disable: on
>>>>>>>>     transport.address-family: inet
>>>>>>>>     diagnostics.brick-log-level: ERROR
>>>>>>>>     diagnostics.client-log-level: ERROR
>>>>>>>>     features.cache-invalidation-timeout: 300
>>>>>>>>     features.cache-invalidation: true
>>>>>>>>     network.ping-timeout: 15
>>>>>>>>     performance.cache-max-file-size: 3MB
>>>>>>>>     performance.md-cache-timeout: 300
>>>>>>>>     server.event-threads: 4
>>>>>>>>
>>>>>>>>     Thanks in advance,
>>>>>>>>
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Sam McLeod (protoporpoise on IRC)
>>>>>>>>     https://smcleod.net <https://smcleod.net/>
>>>>>>>>     https://twitter.com/s_mcleod
>>>>>>>>
>>>>>>>>     Words are my own opinions and do not necessarily represent those
>>>>>>>>     of my employer or partners.
>>>>>>>>
>>>>>>>>     _______________________________________________
>>>>>>>>     Gluster-users mailing list
>>>>>>>>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>>>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>>>     _______________________________________________
>>>>>>>     Gluster-users mailing list
>>>>>>>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Amar Tumballi (amarts)
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181025/00f1153e/attachment.sig>