[Gluster-users] How to diagnose volume rebalance failure?
PuYun
cloudor at 126.com
Tue Dec 15 13:58:38 UTC 2015
Hi Susant,
I find errors in /var/log/messages. The process of glusterfs was killed because of OOM error.
According to my previously posted logs, reblance task was stopped at 2015-12-14 21:46:53
================== FastVol-rebalance.log ============
...
[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1
<EOF>
==============================================
My timezone is GMT+8:00, so the same local time is 2015-12-15 05:46:53. Here in /var/log/messages
================== /var/log/messages ===============
Dec 15 05:46:53 d001 kernel: QThread invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Dec 15 05:46:53 d001 kernel: QThread cpuset=/ mems_allowed=0
Dec 15 05:46:53 d001 kernel: Pid: 1656, comm: QThread Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Dec 15 05:46:53 d001 kernel: Call Trace:
Dec 15 05:46:53 d001 kernel: [<ffffffff810d0431>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122810>] ? dump_header+0x90/0x1b0
Dec 15 05:46:53 d001 kernel: [<ffffffff8122833c>] ? security_real_capable_noaudit+0x3c/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff81122c92>] ? oom_kill_process+0x82/0x2a0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122bd1>] ? select_bad_process+0xe1/0x120
Dec 15 05:46:53 d001 kernel: [<ffffffff811230d0>] ? out_of_memory+0x220/0x3c0
Dec 15 05:46:53 d001 kernel: [<ffffffff8112f9ef>] ? __alloc_pages_nodemask+0x89f/0x8d0
Dec 15 05:46:53 d001 kernel: [<ffffffff811678ea>] ? alloc_pages_current+0xaa/0x110
Dec 15 05:46:53 d001 kernel: [<ffffffff8111fc07>] ? __page_cache_alloc+0x87/0x90
Dec 15 05:46:53 d001 kernel: [<ffffffff8111f5ee>] ? find_get_page+0x1e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff81120ba7>] ? filemap_fault+0x1a7/0x500
Dec 15 05:46:53 d001 kernel: [<ffffffff81149ed4>] ? __do_fault+0x54/0x530
Dec 15 05:46:53 d001 kernel: [<ffffffff8114a4a7>] ? handle_pte_fault+0xf7/0xb00
Dec 15 05:46:53 d001 kernel: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff8114b0da>] ? handle_mm_fault+0x22a/0x300
Dec 15 05:46:53 d001 kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Dec 15 05:46:53 d001 kernel: [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
Dec 15 05:46:53 d001 kernel: [<ffffffff810a01d2>] ? hrtimer_cancel+0x22/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152a7e3>] ? do_nanosleep+0x93/0xc0
Dec 15 05:46:53 d001 kernel: [<ffffffff810a02a4>] ? hrtimer_nanosleep+0xc4/0x180
Dec 15 05:46:53 d001 kernel: [<ffffffff8109f100>] ? hrtimer_wakeup+0x0/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152e74e>] ? do_page_fault+0x3e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff8152bb05>] ? page_fault+0x25/0x30
Dec 15 05:46:53 d001 kernel: Mem-Info:
Dec 15 05:46:53 d001 kernel: Node 0 DMA per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 1: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 186, btch: 31 usd: 199
Dec 15 05:46:53 d001 kernel: CPU 1: hi: 186, btch: 31 usd: 58
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 186, btch: 31 usd: 24
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 186, btch: 31 usd: 75
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 186, btch: 31 usd: 156
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 186, btch: 31 usd: 60
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 186, btch: 31 usd: 32
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 186, btch: 31 usd: 7
Dec 15 05:46:53 d001 kernel: Node 0 Normal per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 186, btch: 31 usd: 155
Dec 15 05:46:53 d001 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 186, btch: 31 usd: 40
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 186, btch: 31 usd: 167
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 186, btch: 31 usd: 2
Dec 15 05:46:53 d001 kernel: active_anon:1955292 inactive_anon:38 isolated_anon:0
Dec 15 05:46:53 d001 kernel: active_file:358 inactive_file:1728 isolated_file:0
Dec 15 05:46:53 d001 kernel: unevictable:0 dirty:14 writeback:0 unstable:0
Dec 15 05:46:53 d001 kernel: free:25714 slab_reclaimable:2453 slab_unreclaimable:7498
Dec 15 05:46:53 d001 kernel: mapped:39 shmem:43 pagetables:4664 bounce:0
Dec 15 05:46:53 d001 kernel: Node 0 DMA free:15752kB min:124kB low:152kB high:184kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 3000 8050 8050
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 free:44860kB min:25140kB low:31424kB high:37708kB active_anon:2737816kB inactive_anon:0kB active_file:732kB inactive_file:5856kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:32kB writeback:0kB mapped:160kB shmem:0kB slab_reclaimable:1696kB slab_unreclaimable:1352kB kernel_stack:72kB pagetables:5460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2447 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 5050 5050
Dec 15 05:46:53 d001 kernel: Node 0 Normal free:42244kB min:42316kB low:52892kB high:63472kB active_anon:5083352kB inactive_anon:152kB active_file:700kB inactive_file:1056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:24kB writeback:4kB mapped:0kB shmem:172kB slab_reclaimable:8116kB slab_unreclaimable:28640kB kernel_stack:2176kB pagetables:13196kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2817 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 05:46:53 d001 kernel: Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15752kB
Dec 15 05:46:53 d001 kernel: Node 0 DMA32: 11186*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB
Dec 15 05:46:53 d001 kernel: Node 0 Normal: 10729*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42940kB
Dec 15 05:46:53 d001 kernel: 2634 total pagecache pages
Dec 15 05:46:53 d001 kernel: 0 pages in swap cache
Dec 15 05:46:53 d001 kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 05:46:53 d001 kernel: Free swap = 0kB
Dec 15 05:46:53 d001 kernel: Total swap = 0kB
Dec 15 05:46:53 d001 kernel: 2097151 pages RAM
Dec 15 05:46:53 d001 kernel: 81926 pages reserved
Dec 15 05:46:53 d001 kernel: 1752 pages shared
Dec 15 05:46:53 d001 kernel: 1983480 pages non-shared
Dec 15 05:46:53 d001 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Dec 15 05:46:53 d001 kernel: [ 477] 0 477 2662 105 0 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 916] 0 916 374 21 0 0 0 aliyun-service
Dec 15 05:46:53 d001 kernel: [ 1156] 0 1156 62798 149 0 0 0 rsyslogd
Dec 15 05:46:53 d001 kernel: [ 1178] 32 1178 4744 62 0 0 0 rpcbind
Dec 15 05:46:53 d001 kernel: [ 1198] 29 1198 5837 112 1 0 0 rpc.statd
Dec 15 05:46:53 d001 kernel: [ 1226] 0 1226 201835 4673 0 0 0 glusterd
Dec 15 05:46:53 d001 kernel: [ 1382] 28 1382 157544 115 0 0 0 nscd
Dec 15 05:46:53 d001 kernel: [ 1414] 0 1414 118751 673 1 0 0 AliYunDunUpdate
Dec 15 05:46:53 d001 kernel: [ 1448] 0 1448 16657 178 4 -17 -1000 sshd
Dec 15 05:46:53 d001 kernel: [ 1463] 38 1463 6683 151 0 0 0 ntpd
Dec 15 05:46:53 d001 kernel: [ 1473] 0 1473 29325 154 1 0 0 crond
Dec 15 05:46:53 d001 kernel: [ 1516] 0 1516 1016 19 1 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1518] 0 1518 1016 17 3 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1520] 0 1520 1016 18 5 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1522] 0 1522 2661 104 4 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1523] 0 1523 2661 104 4 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1524] 0 1524 1016 18 2 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1526] 0 1526 1016 19 4 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1528] 0 1528 1016 19 1 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1581] 0 1581 362551 34027 0 0 0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1599] 0 1599 919605 653045 0 0 0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1652] 0 1652 191799 1125 5 0 0 AliYunDun
Dec 15 05:46:53 d001 kernel: [ 1670] 0 1670 230065 974 0 0 0 AliHids
Dec 15 05:46:53 d001 kernel: [ 1800] 0 1800 1455594 1259392 0 0 0 glusterfs
Dec 15 05:46:53 d001 kernel: [ 2238] 0 2238 4872 55 1 0 0 anacron
Dec 15 05:46:53 d001 kernel: [ 2242] 0 2242 26524 43 1 0 0 run-parts
Dec 15 05:46:53 d001 kernel: [ 2413] 0 2413 26524 39 5 0 0 mlocate.cron
Dec 15 05:46:53 d001 kernel: [ 2414] 0 2414 26488 35 4 0 0 awk
Dec 15 05:46:53 d001 kernel: [ 2419] 0 2419 25272 117 2 0 0 updatedb
Dec 15 05:46:53 d001 kernel: Out of memory: Kill process 1800 (glusterfs) score 596 or sacrifice child
Dec 15 05:46:53 d001 kernel: Killed process 1800, UID 0, (glusterfs) total-vm:5822376kB, anon-rss:5037552kB, file-rss:16kB
==============================================
However, I can find oom in /var/log/messages matching every rebalance failure before. Is it a memory leak bug?
Thank you.
PuYun
From: PuYun
Date: 2015-12-15 21:08
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi Susant,
Thank you for your advice. I have set that parameter and started again in lazy mode. I think maybe it will not fail again this time because every time it fails it happens after migration started. Another important thing is fix-layout rebalance which is without migration can complete after running for 25 hours.
And, I can't grep any errors in the rebalance log file using "grep ' E ' FastVol-rebalance.log".
Thank you.
PuYun
From: Susant Palai
Date: 2015-12-15 15:21
To: PuYun
CC: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi PuYun,
We need to figure out some mechanism to get the huge log files. Until then here is something I can think can be reason that can affect the performance.
The rebalance normally starts in medium level [performance wise] which means for you in this case will generate two threads for migration which can hog on those 2 cores. In case you run rebalance again, run it in lazy mode. Here is the command.
"gluster v set <VOLUME-NAME> rebal-throttle lazy". This should spawn just one thread for migration.
For logs: Can you grep for errors in the rebalance log file and upload? <till we figure out a method to get full logs>
Thanks,
Susant
----- Original Message -----
From: "PuYun" <cloudor at 126.com>
To: "gluster-users" <gluster-users at gluster.org>
Sent: Tuesday, 15 December, 2015 5:51:00 AM
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
Another weird piece of infomation may be useful. The failed task had actually been running for hours, but the status command output only 3621 sec.
============== shell ==============
[root at d001 glusterfs]# gluster volume rebalance FastVol status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 952767 0 0 failed 3621.00
volume rebalance: FastVol: success:
================================
As you can see, I started rebalance task for only 1 time.
======== cmd_history.log-20151215 ======
[2015-12-14 12:50:41.443937] : volume start FastVol : SUCCESS
[2015-12-14 12:55:01.367519] : volume rebalance FastVol start : SUCCESS
[2015-12-14 13:55:22.132199] : volume rebalance FastVol status : SUCCESS
[2015-12-14 23:04:01.780885] : volume rebalance FastVol status : SUCCESS
[2015-12-14 23:35:56.708077] : volume rebalance FastVol status : SUCCESS
=================================
Because the task failed at [ 2015-12-14 21:46:54.179xx], something wrong might happened at 3621 secs before, that is [ 2015-12-14 20:46:33.179xx]. I check logs at that time, found nothing special.
========== FastVol-rebalance.log ========
[2015-12-14 20:46:33.166748] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.171009] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/t2/n1/VSXZlm65KjfhbgoM/flag_finished from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.174851] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.181448] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.184996] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_small.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.191681] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 20:46:33.195396] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_big_square.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
==================================
And, there is no logs around at [ 2015-12-14 20:46:33.179xx ] in mnt-b1-brick.log, mnt-c1-brick.log and etc-glusterfs-glusterd.vol.log.
PuYun
From: PuYun
Date: 2015-12-15 07:30
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
Failed again. I can see disconnections in logs, but no more details.
=========== mnt-b1-brick.log ===========
[2015-12-14 21:46:54.179662] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
[2015-12-14 21:46:54.181764] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.181815] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir
[2015-12-14 21:46:54.181856] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user
[2015-12-14 21:46:54.181918] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.181961] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.182003] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.182036] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.182076] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.182110] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
======================================
============== mnt-c1-brick.log -============
[2015-12-14 21:46:54.179597] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180483] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.180525] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir
[2015-12-14 21:46:54.180570] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user
[2015-12-14 21:46:54.180604] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.180634] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.180678] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.180725] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.180779] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.180820] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
======================================
============== etc-glusterfs-glusterd.vol.log ==========
[2015-12-14 21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock failed (No data available)
[2015-12-14 21:46:54.209586] I [MSGID: 106007] [glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance process for volume FastVol has disconnected.
[2015-12-14 21:46:54.209627] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=588 max=1 total=1
[2015-12-14 21:46:54.209640] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=124 max=1 total=1
=============================================
================== FastVol-rebalance.log ============
...
[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1
<EOF>
==============================================
PuYun
From: PuYun
Date: 2015-12-14 21:51
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
Thank you for your reply. I don't know how to send you the huge sized rebalance log file which is about 2GB.
However, I might have found out the reason why the task failed. My gluster server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task began, top 3 processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are less than 1%. But after a while, 2 CPU cores are used up totally and I even can't login until the rebalance task failed.
It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server with 8 CPU cores and start rebalance task again. Everything goes well for now.
I will report again when the current task completed or failed.
PuYun
From: Nithya Balachandran
Date: 2015-12-14 18:57
To: PuYun
CC: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
Can you send us the rebalance log?
Regards,
Nithya
----- Original Message -----
> From: "PuYun" <cloudor at 126.com>
> To: "gluster-users" <gluster-users at gluster.org>
> Sent: Monday, December 14, 2015 11:33:40 AM
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
>
> Here is the tail of the failed rebalance log, any clue?
>
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir]
> 0-FastVol-dht: Migration operation on dir
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95 secs
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:31:37.210496] I [MSGID: 109081]
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q
> [2015-12-13 21:31:37.722825] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 0
> (FastVol-client-0): 1032124 chunks
> [2015-12-13 21:31:37.722837] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 1
> (FastVol-client-1): 1032124 chunks
> [2015-12-13 21:33:03.955539] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout -
> 2146817920 - 4294967295 - 1
> [2015-12-13 21:33:04.069859] I [MSGID: 109018]
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts for
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid =
> f38c4ed2-a26a-4d83-adfd-6b0331831738
> [2015-12-13 21:33:04.118800] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout -
> 0 - 2146817919 - 1
> [2015-12-13 21:33:19.979507] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:19.979459] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
>
>
> Cloudor
>
>
>
> From: Sakshi Bansal
> Date: 2015-12-12 13:02
> To: 蒲云
> CC: gluster-users
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> In the rebalance log file you can check the file/directory for which the
> rebalance has failed. It can mention what was the fop for whihc the failure
> happened.
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151215/96b9409b/attachment.html>
More information about the Gluster-users
mailing list