[Gluster-users] Remove Brick Rebalance Hangs With No Activity

Sat Oct 26 09:21:49 UTC 2019

According to logs there is some communucation problem.

Check that glusterd is running everywhere and every brick process has a pid & port (gluster volume status should point any issues ).

Best Regards,
Strahil NikolovOn Oct 26, 2019 06:25, Timothy Orme <torme at ancestry.com> wrote:
>
> It looks like this does eventually fail.  At a bit of a loss at what to do here... At this point unable to remove any nodes from the cluster.  Any help is greatly appreciated!
>
> Here's the log from one of the nodes 
>
> [2019-10-26 01:54:35.912284] E [rpc-clnt.c:183:call_bail] 0-scratch-client-4: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x38, unique = 0, sent = 2019-10-26 01:24:35.787361, timeout = 1800 for 10.158.10.2:49152
> [2019-10-26 01:54:35.912304] E [MSGID: 114031] [client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-scratch-client-4: remote operation failed [Transport endpoint is not connected]
> [2019-10-26 02:04:35.000560] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704
> [2019-10-26 02:04:35.000589] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704
> [2019-10-26 02:04:35.000595] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704
> [2019-10-26 02:14:35.000669] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704
> [2019-10-26 02:14:35.000697] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704
> [2019-10-26 02:14:35.000703] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704
> [2019-10-26 02:24:35.000682] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704
> [2019-10-26 02:24:35.000712] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704
> [2019-10-26 02:24:35.000718] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704
> [2019-10-26 02:24:35.867168] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 0-scratch-client-3: cannot lookup the saved frame corresponding to xid (55)
> [2019-10-26 02:24:35.867505] W [socket.c:2183:__socket_read_reply] 0-scratch-client-3: notify for event MAP_XID failed for 10.158.10.1:49152
> [2019-10-26 02:24:35.867530] I [MSGID: 114018] [client.c:2323:client_rpc_notify] 0-scratch-client-3: disconnected from scratch-client-3. Client process will keep trying to connect to glusterd until brick's port is available
> [2019-10-26 02:24:35.867641] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 0-scratch-client-4: cannot lookup the saved frame corresponding to xid (56)
> [2019-10-26 02:24:35.867657] W [socket.c:2183:__socket_read_reply] 0-scratch-client-4: notify for event MAP_XID failed for 10.158.10.2:49152
> [2019-10-26 02:24:35.867670] I [MSGID: 114018] [client.c:2323:client_rpc_notify] 0-scratch-client-4: disconnected from scratch-client-4. Client process will keep trying to connect to glusterd until brick's port is available
> [2019-10-26 02:24:35.867679] W [MSGID: 108001] [afr-common.c:5608:afr_notify] 0-scratch-replicate-0: Client-quorum is not met
> [2019-10-26 02:24:35.868083] E [MSGID: 109119] [dht-lock.c:1084:dht_blocking_inodelk_cbk] 0-scratch-dht: inodelk failed on subvol scratch-replicate-0, gfid:be318638-e8a0-4c6d-977d-7a937aa84806 [Transport endpoint is not connected]
> [2019-10-26 02:24:35.868151] E [MSGID: 109016] [dht-rebalance.c:3932:gf_defrag_fix_layout] 0-scratch-dht: Setxattr failed for /.shard [Transport endpoint is not connected]
> [2019-10-26 02:24:35.868904] E [MSGID: 109016] [dht-rebalance.c:3898:gf_defrag_fix_layout] 0-scratch-dht: Fix layout failed for /.shard
> [2019-10-26 02:24:35.870516] I [MSGID: 109028] [dht-rebalance.c:5047:gf_defrag_status_get] 0-scratch-dht: Rebalance is failed. Time taken is 5401.00 secs
> [2019-10-26 02:24:35.870531] I [MSGID: 109028] [dht-rebalance.c:5053:gf_defrag_status_get] 0-scratch-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
> [2019-10-26 02:24:35.871330] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x754b) [0x7febd4c9154b] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55ec1a066b9d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55ec1a0669e4] ) 0-: received signum (15), shutting down
>
> Thanks!
> Tim
>
>
> ________________________________
> From: Timothy Orme
> Sent: Friday, October 25, 2019 11:51 AM
> To: gluster-users <gluster-users at gluster.org>
> Subject: Remove Brick Rebalance Hangs With No Activity
>  
> Hello All,
>
> I'm trying to remove a set of bricks from our cluster.  I've done this operation a few times now with success, but on one set of bricks, the operation starts and seems to never progress.  It just sits here:
>
>                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
>                                ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
>              ip-10-158-10-1.ec2.internal                0        0Bytes             0             0             0          in progress        0:22:35
>             ip-10-158-10-2.ec2.internal                0        0Bytes             0             0             0          in progress        0:22:35
>            ip-10-158-10-3.ec2.internal                0        0Bytes             0             0             0          in progress        0:22:35
> Rebalance estimated time unavailable. Please try again later.
>
> The rebalance logs on the server don't seem to indicate any issues.  I see no error statements or anything.  The servers themselves also seem very idle.  CPU and Network Activity are stuck at near 0, where as during other removals they would spike almost immediately.
>
> There's almost no activity in the log either.  The only thing that I've seen is a message like:
>
> [2019-10-25 18:42:21.000753] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-2,cnt = 596361801728
> [2019-10-25 18:42:21.000799] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 596361801728
> [2019-10-25 18:42:21.000808] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =596361801728
>
> Any idea what might be happening?
>
> Thanks,
> Tim
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191026/6b1a2a00/attachment.html>