[Gluster-users] Production Volume will not start

Atin Mukherjee amukherj at redhat.com
Mon Dec 18 07:26:15 UTC 2017


On Sat, Dec 16, 2017 at 12:45 AM, Matt Waymack <mwaymack at nsgdv.com> wrote:

> Hi all,
>
>
>
> I have an issue where our volume will not start from any node.  When
> attempting to start the volume it will eventually return:
>
> Error: Request timed out
>
>
>
> For some time after that, the volume is locked and we either have to wait
> or restart Gluster services.  In the gluserd.log, it shows the following:
>
>
>
> [2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b1/gv0
>
> [2017-12-15 18:03:12.673885] I [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk]
> 0-management: In gd_mgmt_v3_unlock_timer_cbk
>
> [2017-12-15 18:06:34.304868] I [MSGID: 106499] [glusterd-handler.c:4303:__glusterd_handle_status_volume]
> 0-management: Received status volume req for volume gv0
>
> [2017-12-15 18:06:34.306603] E [MSGID: 106301] [glusterd-syncop.c:1353:gd_stage_op_phase]
> 0-management: Staging of operation 'Volume Status' failed on localhost :
> Volume gv0 is not started
>
> [2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b2/gv0
>
> [2017-12-15 18:11:42.405966] I [MSGID: 106143] [glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b2/gv0 on port 49153
>
> [2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b3/gv0
>
> [2017-12-15 18:14:39.737192] I [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk]
> 0-management: In gd_mgmt_v3_unlock_timer_cbk
>
> [2017-12-15 18:35:20.856849] I [MSGID: 106143] [glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b1/gv0 on port 49152
>
> [2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start]
> 0-management: starting a fresh brick process for brick /exp/b4/gv0
>
> [2017-12-15 18:46:07.953995] I [MSGID: 106143] [glusterd-pmap.c:280:pmap_registry_bind]
> 0-pmap: adding brick /exp/b3/gv0 on port 49154
>
> [2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-snapd: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-nfs: setting frame-timeout to 600
>
> [2017-12-15 18:46:07.989543] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: nfs already stopped
>
> [2017-12-15 18:46:07.989562] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: nfs service is stopped
>
> [2017-12-15 18:46:07.989575] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager]
> 0-management: nfs/server.so xlator is not installed
>
> [2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-glustershd: setting frame-timeout to 600
>
> [2017-12-15 18:46:08.003011] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: glustershd already stopped
>
> [2017-12-15 18:46:08.003039] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: glustershd service is stopped
>
> [2017-12-15 18:46:08.003079] I [MSGID: 106567] [glusterd-svc-mgmt.c:197:glusterd_svc_start]
> 0-management: Starting glustershd service
>
> [2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-quotad: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-bitd: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005673] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: bitd already stopped
>
> [2017-12-15 18:46:09.005689] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: bitd service is stopped
>
> [2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init]
> 0-scrub: setting frame-timeout to 600
>
> [2017-12-15 18:46:09.005892] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: scrub already stopped
>
> [2017-12-15 18:46:09.005912] I [MSGID: 106568] [glusterd-svc-mgmt.c:229:glusterd_svc_stop]
> 0-management: scrub service is stopped
>
> [2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
>
> [2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
>
> [2017-12-15 18:46:09.026582] E [MSGID: 106430] [glusterd-utils.c:568:glusterd_submit_reply]
> 0-glusterd: Reply submission failed
>
> [2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management:
> bailing out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent =
> 2017-12-15 18:46:09.005976. timeout = 600 for 10.17.100.208:24007
>

There's a call bail here which means glusterd was never able to get a cbk
response back from nsgtpcfs02.corp.nsgdv.com .

I am guessing you have ended up with a duplicate peerinfo entry of
nsgtpcfs02.corp.nsgdv.com in /var/lib/glusterd/peers folder on the node
where the CLI failed. Can you please share the output of gluster peer
status along with the content of "cat /var/lib/glusterd/peers/* " from all
the nodes?

[2017-12-15 18:56:17.962324] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors]
> 0-management: Commit failed on nsgtpcfs02.corp.nsgdv.com. Please check
> log file for details.
>
> [2017-12-15 18:56:17.962408] E [MSGID: 106123] [glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit]
> 0-management: Commit failed on peers
>
> [2017-12-15 18:56:17.962656] E [MSGID: 106123] [glusterd-mgmt.c:2209:
> glusterd_mgmt_v3_initiate_all_phases] 0-management: Commit Op Failed
>
> [2017-12-15 18:56:17.964004] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on nsgtpcfs02.corp.nsgdv.com. Please check log file for details.
>
> [2017-12-15 18:56:17.965184] E [MSGID: 106116]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on tpc-arbiter1-100617. Please check log file for details.
>
> [2017-12-15 18:56:17.965277] E [MSGID: 106118] [glusterd-mgmt.c:2087:
> glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers
>
> [2017-12-15 18:56:17.965372] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631)
> [0x7f48e44a1631] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e)
> [0x7f48e44a143e] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625)
> [0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held
>
> [2017-12-15 18:56:17.965394] E [MSGID: 106118] [glusterd-locks.c:356:
> glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for
> vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc.
>
> [2017-12-15 18:56:17.965409] E [MSGID: 106147] [glusterd-locks.c:483:
> glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol
>
> [2017-12-15 18:56:17.965424] E [MSGID: 106118] [glusterd-mgmt.c:2240:
> glusterd_mgmt_v3_initiate_all_phases] 0-management: Failed to release
> mgmt_v3 locks on localhost
>
> [2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
>
> [2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc
> cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management)
>
> [2017-12-15 18:56:17.965486] E [MSGID: 106430] [glusterd-utils.c:568:glusterd_submit_reply]
> 0-glusterd: Reply submission failed
>
>
>
> This issue started after a gluster volume stop followed by a reboot of all
> nodes.  We also updated to the latest available in the CentOS repo and are
> at version 3.12.3.  I’m not sure where to look as the log doesn’t seem to
> show me anything other than it just not working.
>
>
>
> gluster peer status shows all peers connected across all nodes, firewall
> has all ports opened and was disabled for troubleshooting.  The volume is a
> distributed-replicated with arbiter for a total of 3 nodes.
>
>
>
> The volume is a production volume with over 120TB of data so I’d really
> like to not have to start over with the volume.  Anyone have any
> suggestions on where else to look?
>
>
>
> Thank you!
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171218/09799ebb/attachment.html>


More information about the Gluster-users mailing list