<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thats what I thought as well. All instances seem to be responding and alive according to the volume status. I also was able to run a `rebalance fix-layout` without any issues, so it seems that communication between the nodes is OK. I also tried replacing
the 10.158.10.1 brick with an entirely new server since that seemed to be the common one between in the logs. Self heal ran just fine in that replica set. However, it still is just hanging on the removal when I try and then remove those bricks.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I might try and full rebalance as well, just to verify that it works.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Only other thing I can think to note is that I'm using SSL for both client and server, and maybe thats obfuscating some more important error message, but it would still seem odd given that other communication between the nodes is just fine.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Any other suggestions for things to try, or other log locations to check on?</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Tim<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Strahil <hunter86_bg@yahoo.com><br>
<b>Sent:</b> Saturday, October 26, 2019 2:21 AM<br>
<b>To:</b> Timothy Orme <torme@ancestry.com>; gluster-users <gluster-users@gluster.org><br>
<b>Subject:</b> [EXTERNAL] Re: [Gluster-users] Remove Brick Rebalance Hangs With No Activity</font>
<div> </div>
</div>
<div>
<p dir="ltr">According to logs there is some communucation problem.</p>
<p dir="ltr">Check that glusterd is running everywhere and every brick process has a pid & port (gluster volume status should point any issues ).</p>
<p dir="ltr">Best Regards,<br>
Strahil Nikolov</p>
<div class="x_quote">On Oct 26, 2019 06:25, Timothy Orme <torme@ancestry.com> wrote:<br type="attribution">
<blockquote class="x_quote" style="margin:0 0 0 .8ex; border-left:1px #ccc solid; padding-left:1ex">
<div dir="ltr">
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
It looks like this does eventually fail. At a bit of a loss at what to do here... At this point unable to remove any nodes from the cluster. Any help is greatly appreciated!<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Here's the log from one of the nodes <br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
</div>
[2019-10-26 01:54:35.912284] E [rpc-clnt.c:183:call_bail] 0-scratch-client-4: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x38, unique = 0, sent = 2019-10-26 01:24:35.787361, timeout = 1800 for
<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.2-3A49152&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=_vOnhjdfuMWECsVUDEHzP4-e90z9Xyvel2CXsbtzeWY&s=ZfU7EXb4XCj6XngfxkJ2nNvAtgGeZt7M3NTn4rHpjcs&e=">
10.158.10.2:49152</a><br>
<div>[2019-10-26 01:54:35.912304] E [MSGID: 114031] [client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-scratch-client-4: remote operation failed [Transport endpoint is not connected]<br>
</div>
<div>[2019-10-26 02:04:35.000560] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704<br>
</div>
<div>[2019-10-26 02:04:35.000589] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704<br>
</div>
<div>[2019-10-26 02:04:35.000595] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704<br>
</div>
<div>[2019-10-26 02:14:35.000669] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704<br>
</div>
<div>[2019-10-26 02:14:35.000697] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704<br>
</div>
<div>[2019-10-26 02:14:35.000703] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704<br>
</div>
<div>[2019-10-26 02:24:35.000682] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-0,cnt = 1076350152704<br>
</div>
<div>[2019-10-26 02:24:35.000712] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 1076350152704<br>
</div>
<div>[2019-10-26 02:24:35.000718] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =1076350152704<br>
</div>
<div>[2019-10-26 02:24:35.867168] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 0-scratch-client-3: cannot lookup the saved frame corresponding to xid (55)<br>
</div>
<div>[2019-10-26 02:24:35.867505] W [socket.c:2183:__socket_read_reply] 0-scratch-client-3: notify for event MAP_XID failed for
<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.1-3A49152&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=_vOnhjdfuMWECsVUDEHzP4-e90z9Xyvel2CXsbtzeWY&s=lXuIiMoDiIcdC-2gPb8hAbQq8e5CvPCFyK6CcKq9IX8&e=">
10.158.10.1:49152</a><br>
</div>
<div>[2019-10-26 02:24:35.867530] I [MSGID: 114018] [client.c:2323:client_rpc_notify] 0-scratch-client-3: disconnected from scratch-client-3. Client process will keep trying to connect to glusterd until brick's port is available<br>
</div>
<div>[2019-10-26 02:24:35.867641] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 0-scratch-client-4: cannot lookup the saved frame corresponding to xid (56)<br>
</div>
<div>[2019-10-26 02:24:35.867657] W [socket.c:2183:__socket_read_reply] 0-scratch-client-4: notify for event MAP_XID failed for
<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__10.158.10.2-3A49152&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=_vOnhjdfuMWECsVUDEHzP4-e90z9Xyvel2CXsbtzeWY&s=ZfU7EXb4XCj6XngfxkJ2nNvAtgGeZt7M3NTn4rHpjcs&e=">
10.158.10.2:49152</a><br>
</div>
<div>[2019-10-26 02:24:35.867670] I [MSGID: 114018] [client.c:2323:client_rpc_notify] 0-scratch-client-4: disconnected from scratch-client-4. Client process will keep trying to connect to glusterd until brick's port is available<br>
</div>
<div>[2019-10-26 02:24:35.867679] W [MSGID: 108001] [afr-common.c:5608:afr_notify] 0-scratch-replicate-0: Client-quorum is not met<br>
</div>
<div>[2019-10-26 02:24:35.868083] E [MSGID: 109119] [dht-lock.c:1084:dht_blocking_inodelk_cbk] 0-scratch-dht: inodelk failed on subvol scratch-replicate-0, gfid:be318638-e8a0-4c6d-977d-7a937aa84806 [Transport endpoint is not connected]<br>
</div>
<div>[2019-10-26 02:24:35.868151] E [MSGID: 109016] [dht-rebalance.c:3932:gf_defrag_fix_layout] 0-scratch-dht: Setxattr failed for /.shard [Transport endpoint is not connected]<br>
</div>
<div>[2019-10-26 02:24:35.868904] E [MSGID: 109016] [dht-rebalance.c:3898:gf_defrag_fix_layout] 0-scratch-dht: Fix layout failed for /.shard<br>
</div>
<div>[2019-10-26 02:24:35.870516] I [MSGID: 109028] [dht-rebalance.c:5047:gf_defrag_status_get] 0-scratch-dht: Rebalance is failed. Time taken is 5401.00 secs<br>
</div>
<div>[2019-10-26 02:24:35.870531] I [MSGID: 109028] [dht-rebalance.c:5053:gf_defrag_status_get] 0-scratch-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0<br>
</div>
<div>[2019-10-26 02:24:35.871330] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__libpthread.so&d=DwMGaQ&c=kKqjBR9KKWaWpMhASkPbOg&r=d0SJB4ihnau-Oyws6GEzcipkV9DfxCuMbgdSRgXeuxM&m=_vOnhjdfuMWECsVUDEHzP4-e90z9Xyvel2CXsbtzeWY&s=Svqfg7mAFbAwiV64uxcKibYnZfkmHM1f9um4A5ziKpo&e=">libpthread.so</a>.0(+0x754b)
[0x7febd4c9154b] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55ec1a066b9d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55ec1a0669e4] ) 0-: received signum (15), shutting down</div>
<div><br>
</div>
<div>Thanks!<br>
Tim<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<div><br>
</div>
<br>
</div>
<div></div>
<hr style="display:inline-block; width:98%">
<div dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Timothy Orme<br>
<b>Sent:</b> Friday, October 25, 2019 11:51 AM<br>
<b>To:</b> gluster-users <gluster-users@gluster.org><br>
<b>Subject:</b> Remove Brick Rebalance Hangs With No Activity</font>
<div> </div>
</div>
<div dir="ltr">
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Hello All,</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
I'm trying to remove a set of bricks from our cluster. I've done this operation a few times now with success, but on one set of bricks, the operation starts and seems to never progress. It just sits here:</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Node Rebalanced-files size scanned failures skipped status run time in h:m:s<br>
<div> --------- ----------- ----------- ----------- ----------- ----------- ------------ --------------<br>
</div>
<div> ip-10-158-10-1.ec2.internal 0 0Bytes 0 0 0 in progress 0:22:35<br>
</div>
<div> ip-10-158-10-2.ec2.internal 0 0Bytes 0 0 0 in progress 0:22:35<br>
</div>
<div> ip-10-158-10-3.ec2.internal 0 0Bytes 0 0 0 in progress 0:22:35<br>
</div>
<div>Rebalance estimated time unavailable. Please try again later.<br>
</div>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
The rebalance logs on the server don't seem to indicate any issues. I see no error statements or anything. The servers themselves also seem very idle. CPU and Network Activity are stuck at near 0, where as during other removals they would spike almost immediately.</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
There's almost no activity in the log either. The only thing that I've seen is a message like:</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
[2019-10-25 18:42:21.000753] I [MSGID: 0] [dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: scratch-replicate-2,cnt = 596361801728<br>
<div>[2019-10-25 18:42:21.000799] I [MSGID: 0] [dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size files = 596361801728<br>
</div>
<div>[2019-10-25 18:42:21.000808] I [dht-rebalance.c:4355:dht_file_counter_thread] 0-dht: tmp data size =596361801728<br>
</div>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Any idea what might be happening?</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Thanks,</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
Tim<br>
</div>
<div style="font-family:'calibri' ,'arial' ,'helvetica' ,sans-serif; font-size:12pt; color:rgb(0 ,0 ,0)">
<br>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>