[Gluster-users] How to diagnose volume rebalance failure?

Susant Palai spalai at redhat.com
Tue Dec 15 07:21:51 UTC 2015


Hi PuYun,
  We need to figure out some mechanism to get the huge log files. Until then here is something I can think can be reason that can affect the performance.

The rebalance normally starts in medium level [performance wise] which means for you in this case will generate two threads for migration which can hog on those 2 cores. In case you run rebalance again, run it in lazy mode. Here is the command.

"gluster v set <VOLUME-NAME> rebal-throttle lazy". This should spawn just one thread for migration.

For logs: Can you grep for errors in the rebalance log file and upload? <till we figure out a method to get full logs>

Thanks,
Susant

----- Original Message -----
From: "PuYun" <cloudor at 126.com>
To: "gluster-users" <gluster-users at gluster.org>
Sent: Tuesday, 15 December, 2015 5:51:00 AM
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?



Hi, 


Another weird piece of infomation may be useful. The failed task had actually been running for hours, but the status command output only 3621 sec. 


============== shell ============== 
[root at d001 glusterfs]# gluster volume rebalance FastVol status 
Node Rebalanced-files size scanned failures skipped status run time in secs 
--------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 
localhost 0 0Bytes 952767 0 0 failed 3621.00 
volume rebalance: FastVol: success: 

================================ 


As you can see, I started rebalance task for only 1 time. 
======== cmd_history.log-20151215 ====== 
[2015-12-14 12:50:41.443937] : volume start FastVol : SUCCESS 
[2015-12-14 12:55:01.367519] : volume rebalance FastVol start : SUCCESS 
[2015-12-14 13:55:22.132199] : volume rebalance FastVol status : SUCCESS 
[2015-12-14 23:04:01.780885] : volume rebalance FastVol status : SUCCESS 
[2015-12-14 23:35:56.708077] : volume rebalance FastVol status : SUCCESS 

================================= 


Because the task failed at [ 2015-12-14 21:46:54.179xx], something wrong might happened at 3621 secs before, that is [ 2015-12-14 20:46:33.179xx]. I check logs at that time, found nothing special. 
========== FastVol-rebalance.log ======== 
[2015-12-14 20:46:33.166748] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.171009] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/t2/n1/VSXZlm65KjfhbgoM/flag_finished from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.174851] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.181448] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.184996] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_small.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.191681] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.195396] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_big_square.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 

================================== 


And, there is no logs around at [ 2015-12-14 20:46:33.179xx ] in mnt-b1-brick.log, mnt-c1-brick.log and etc-glusterfs-glusterd.vol.log. 




PuYun 





From: PuYun 
Date: 2015-12-15 07:30 
To: gluster-users 
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure? 


Hi, 


Failed again. I can see disconnections in logs, but no more details. 


=========== mnt-b1-brick.log =========== 
[2015-12-14 21:46:54.179662] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0 
[2015-12-14 21:46:54.181764] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on / 
[2015-12-14 21:46:54.181815] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir 
[2015-12-14 21:46:54.181856] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user 
[2015-12-14 21:46:54.181918] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg 
[2015-12-14 21:46:54.181961] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an 
[2015-12-14 21:46:54.182003] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif 
[2015-12-14 21:46:54.182036] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji 
[2015-12-14 21:46:54.182076] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay 
[2015-12-14 21:46:54.182110] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00 
[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0 

====================================== 


============== mnt-c1-brick.log -============ 
[2015-12-14 21:46:54.179597] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0 
[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff} 
[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff} 
[2015-12-14 21:46:54.180483] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on / 
[2015-12-14 21:46:54.180525] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir 
[2015-12-14 21:46:54.180570] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user 
[2015-12-14 21:46:54.180604] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg 
[2015-12-14 21:46:54.180634] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji 
[2015-12-14 21:46:54.180678] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay 
[2015-12-14 21:46:54.180725] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00 
[2015-12-14 21:46:54.180779] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif 
[2015-12-14 21:46:54.180820] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an 
[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0 

====================================== 




============== etc-glusterfs-glusterd.vol.log ========== 
[2015-12-14 21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock failed (No data available) 
[2015-12-14 21:46:54.209586] I [MSGID: 106007] [glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance process for volume FastVol has disconnected. 
[2015-12-14 21:46:54.209627] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=588 max=1 total=1 
[2015-12-14 21:46:54.209640] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=124 max=1 total=1 

============================================= 




================== FastVol-rebalance.log ============ 
... 
[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1 

<EOF> 
============================================== 





PuYun 





From: PuYun 
Date: 2015-12-14 21:51 
To: gluster-users 
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure? 


Hi, 


Thank you for your reply. I don't know how to send you the huge sized rebalance log file which is about 2GB. 


However, I might have found out the reason why the task failed. My gluster server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task began, top 3 processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are less than 1%. But after a while, 2 CPU cores are used up totally and I even can't login until the rebalance task failed. 


It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server with 8 CPU cores and start rebalance task again. Everything goes well for now. 


I will report again when the current task completed or failed. 





PuYun 





From: Nithya Balachandran 
Date: 2015-12-14 18:57 
To: PuYun 
CC: gluster-users 
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure? 

Hi, 

Can you send us the rebalance log? 

Regards, 
Nithya 

----- Original Message ----- 
> From: "PuYun" <cloudor at 126.com> 
> To: "gluster-users" <gluster-users at gluster.org> 
> Sent: Monday, December 14, 2015 11:33:40 AM 
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure? 
> 
> Here is the tail of the failed rebalance log, any clue? 
> 
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir] 
> 0-FastVol-dht: Migration operation on dir 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95 secs 
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:31:37.210496] I [MSGID: 109081] 
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q 
> [2015-12-13 21:31:37.722825] I [MSGID: 109045] 
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 0 
> (FastVol-client-0): 1032124 chunks 
> [2015-12-13 21:31:37.722837] I [MSGID: 109045] 
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 1 
> (FastVol-client-1): 1032124 chunks 
> [2015-12-13 21:33:03.955539] I [MSGID: 109064] 
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol: 
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout - 
> 2146817920 - 4294967295 - 1 
> [2015-12-13 21:33:04.069859] I [MSGID: 109018] 
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts for 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid = 
> f38c4ed2-a26a-4d83-adfd-6b0331831738 
> [2015-12-13 21:33:04.118800] I [MSGID: 109064] 
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol: 
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout - 
> 0 - 2146817919 - 1 
> [2015-12-13 21:33:19.979507] I [MSGID: 109022] 
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration 
> of 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished 
> from subvolume FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:19.979459] I [MSGID: 109022] 
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration 
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint 
> from subvolume FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> 
> 
> Cloudor 
> 
> 
> 
> From: Sakshi Bansal 
> Date: 2015-12-12 13:02 
> To: 蒲云 
> CC: gluster-users 
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure? 
> In the rebalance log file you can check the file/directory for which the 
> rebalance has failed. It can mention what was the fop for whihc the failure 
> happened. 
> 
> _______________________________________________ 
> Gluster-users mailing list 
> Gluster-users at gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-users 
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


More information about the Gluster-users mailing list