[Bugs] [Bug 1225997] New: rebalance failing on one of the node
bugzilla at redhat.com
bugzilla at redhat.com
Thu May 28 17:18:20 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1225997
Bug ID: 1225997
Summary: rebalance failing on one of the node
Product: GlusterFS
Version: 3.7.0
Component: distribute
Severity: high
Assignee: bugs at gluster.org
Reporter: srangana at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
nbalacha at redhat.com, sasundar at redhat.com
Depends On: 1221696
Blocks: 1221656
+++ This bug was initially created as a clone of Bug #1221696 +++
+++ This bug was initially created as a clone of Bug #1221656 +++
Description of problem:
-----------------------
I was using 2 RHEL 6.6 machines and installed glusterfs-3.7.0beta2 builds. Each
node has 3 bricks. After creating a cluster of these 2 nodes, by peer probing,
I created a distributed replicate volume of 2X2 bricks.
Adding a pair of another bricks to this volume and rebalancing resulted in
rebalance failing in one of the node.
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-3.7.0beta2 build
How reproducible:
------------------
Always
Steps to Reproduce:
-------------------
1. Create a 2 node cluster with 3 bricks per node
2. Create a distributed-replicate volume of 2X2
3. Start the volume
4. Mount the volume ( fuse, nfs )
5. Create few files on the mount
6. Add a pair of bricks to the volume
7. Perform rebalance
Actual results:
---------------
Rebalance failed on the second node
Expected results:
-----------------
Rebalance should complete successfully
--- Additional comment from SATHEESARAN on 2015-05-14 09:57:44 EDT ---
[root@~]# gluster volume rebalance vmstore start
volume rebalance: vmstore: success: Rebalance on vmstore has been started
successfully. Use rebalance status command to check status of the rebalance
process.
ID: 9372b71c-e6f4-44fb-a2e4-9707443f3457
[root@ ~]# gluster volume rebalance vmstore status
Node Rebalanced-files size
scanned failures skipped status run time in secs
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes
3 0 1 completed 1.00
10.70.37.58 0 0Bytes
0 3 0 failed 0.00
volume rebalance: vmstore: success:
<snip_rebalance_logs>
[2015-05-14 19:17:41.419890] I [dht-rebalance.c:2112:gf_defrag_process_dir]
0-vmstore-dht: migrate data called on /
[2015-05-14 19:17:41.424661] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht:
fixing the layout of /.trashcan
[2015-05-14 19:17:41.424688] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0
(vmstore-replicate-0): 101834 chunks
[2015-05-14 19:17:41.424699] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1
(vmstore-replicate-1): 101834 chunks
[2015-05-14 19:17:41.424708] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2
(vmstore-replicate-2): 101834 chunks
[2015-05-14 19:17:41.434411] I [dht-rebalance.c:2112:gf_defrag_process_dir]
0-vmstore-dht: migrate data called on /.trashcan
[2015-05-14 19:17:41.446254] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht:
fixing the layout of /.trashcan/internal_op
[2015-05-14 19:17:41.446279] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0
(vmstore-replicate-0): 101834 chunks
[2015-05-14 19:17:41.446290] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1
(vmstore-replicate-1): 101834 chunks
[2015-05-14 19:17:41.446298] I
[dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2
(vmstore-replicate-2): 101834 chunks
[2015-05-14 19:17:41.453365] I [dht-rebalance.c:2112:gf_defrag_process_dir]
0-vmstore-dht: migrate data called on /.trashcan/internal_op
[2015-05-14 19:17:41.458214] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht:
fixing the layout of /.trashcan/internal_op
[2015-05-14 19:17:41.458542] E [dht-rebalance.c:2368:gf_defrag_settle_hash]
0-vmstore-dht: fix layout on /.trashcan/internal_op failed
[2015-05-14 19:17:41.458824] E [MSGID: 109016]
[dht-rebalance.c:2528:gf_defrag_fix_layout] 0-vmstore-dht: Fix layout failed
for /.trashcan
</snip_rebalance_logs>
--- Additional comment from SATHEESARAN on 2015-05-14 09:59:17 EDT ---
Following is the mail conversation from Nithya to gluster-devel for this issue
:
<snip>
The rebalance failure is due to the interaction of the lookup-unhashed changes
and rebalance local crawl changes.
</snip>
--- Additional comment from Anand Avati on 2015-05-14 11:09:36 EDT ---
REVIEW: http://review.gluster.org/10786 (dht/rebalance : Fixed rebalance
failure) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)
--- Additional comment from Anand Avati on 2015-05-14 11:17:36 EDT ---
REVIEW: http://review.gluster.org/10786 (dht/rebalance : Fixed rebalance
failure) posted (#2) for review on master by N Balachandran
(nbalacha at redhat.com)
--- Additional comment from Anand Avati on 2015-05-14 15:13:45 EDT ---
COMMIT: http://review.gluster.org/10786 committed in master by Shyamsundar
Ranganathan (srangana at redhat.com)
------
commit 1cabc769c7b636f89f6f28aaa0d534401a82d4a8
Author: Nithya Balachandran <nbalacha at redhat.com>
Date: Thu May 14 19:33:44 2015 +0530
dht/rebalance : Fixed rebalance failure
The rebalance process determines the local subvols for the
node it is running on and only acts on files in those subvols.
If a dist-rep or dist-disperse volume is created on 2 nodes by
dividing the bricks equally across the nodes, one process might
determine it has no local_subvols.
When trying to update the commit hash, the function attempts to
lock all local subvols. On the node with no local_subvols the dht
inode lock operation fails, in turn causing the rebalance to fail.
In a dist-rep volume with 2 nodes, if brick 0 of each replica
set is on node1 and brick 1 is on node2, node2 will find that it has
no local subvols.
Change-Id: I7d73b5b4bf1c822eae6df2e6f79bd6a1606f4d1c
BUG: 1221696
Signed-off-by: Nithya Balachandran <nbalacha at redhat.com>
Reviewed-on: http://review.gluster.org/10786
Reviewed-by: Shyamsundar Ranganathan <srangana at redhat.com>
Reviewed-by: Susant Palai <spalai at redhat.com>
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1221656
[Bug 1221656] rebalance failing on one of the node
https://bugzilla.redhat.com/show_bug.cgi?id=1221696
[Bug 1221696] rebalance failing on one of the node
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list