[Bugs] [Bug 1224116] server crashed during rebalance in dht_selfheal_layout_new_directory
bugzilla at redhat.com
bugzilla at redhat.com
Tue Jun 16 12:40:50 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1224116
Amit Chaurasia <achauras at redhat.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ON_QA |ASSIGNED
CC| |achauras at redhat.com
--- Comment #3 from Amit Chaurasia <achauras at redhat.com> ---
Marking this bug as failed because rebalance could not successfully complete.
Here is what I did:
1. Created a dist-disperse volume as below.
[root at dht-rhs-24 ~]# gluster v info
Volume Name: testvol
Type: Distributed-Disperse
Volume ID: 40b53648-3a68-4c15-baf2-ccb04cd82356
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.47.114:/bricks/brick1/testvol
Brick2: 10.70.47.174:/bricks/brick1/testvol
Brick3: 10.70.47.114:/bricks/brick2/testvol
Brick4: 10.70.47.174:/bricks/brick2/testvol
Brick5: 10.70.47.114:/bricks/brick3/testvol
Brick6: 10.70.47.174:/bricks/brick3/testvol
Brick7: 10.70.47.114:/bricks/brick4/testvol
Brick8: 10.70.47.174:/bricks/brick4/testvol
Brick9: 10.70.47.114:/bricks/brick5/testvol
Brick10: 10.70.47.174:/bricks/brick5/testvol
Brick11: 10.70.47.114:/bricks/brick6/testvol
Brick12: 10.70.47.174:/bricks/brick6/testvol
Options Reconfigured:
performance.readdir-ahead: on
[root at dht-rhs-24 ~]#
2. Fuse mounted the volume on 2 clients & created files of 16M each. Total data
on volume was of 35G
10.70.47.174:/testvol
109G 35G 74G 33% /mnt/glusterfs
3. After file creation, I added a set of bricks to the volume:
[root at dht-rhs-24 ~]# gluster v add-brick testvol
10.70.47.114:/bricks/brick7/testvol 10.70.47.174:/bricks/brick7/testvol
10.70.47.114:/bricks/brick8/testvol 10.70.47.174:/bricks/brick8/testvol
10.70.47.114:/bricks/brick9/testvol 10.70.47.174:/bricks/brick9/testvol
10.70.47.114:/bricks/brick10/testvol 10.70.47.174:/bricks/brick10/testvol
10.70.47.114:/bricks/brick11/testvol 10.70.47.174:/bricks/brick11/testvol
10.70.47.114:/bricks/brick12/testvol 10.70.47.174:/bricks/brick12/testvol
volume add-brick: success
[root at dht-rhs-24 ~]#
[root at dht-rhs-24 ~]#
[root at dht-rhs-24 ~]# gluster v info
Volume Name: testvol
Type: Distributed-Disperse
Volume ID: 40b53648-3a68-4c15-baf2-ccb04cd82356
Status: Started
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: 10.70.47.114:/bricks/brick1/testvol
Brick2: 10.70.47.174:/bricks/brick1/testvol
Brick3: 10.70.47.114:/bricks/brick2/testvol
Brick4: 10.70.47.174:/bricks/brick2/testvol
Brick5: 10.70.47.114:/bricks/brick3/testvol
Brick6: 10.70.47.174:/bricks/brick3/testvol
Brick7: 10.70.47.114:/bricks/brick4/testvol
Brick8: 10.70.47.174:/bricks/brick4/testvol
Brick9: 10.70.47.114:/bricks/brick5/testvol
Brick10: 10.70.47.174:/bricks/brick5/testvol
Brick11: 10.70.47.114:/bricks/brick6/testvol
Brick12: 10.70.47.174:/bricks/brick6/testvol
Brick13: 10.70.47.114:/bricks/brick7/testvol
Brick14: 10.70.47.174:/bricks/brick7/testvol
Brick15: 10.70.47.114:/bricks/brick8/testvol
Brick16: 10.70.47.174:/bricks/brick8/testvol
Brick17: 10.70.47.114:/bricks/brick9/testvol
Brick18: 10.70.47.174:/bricks/brick9/testvol
Brick19: 10.70.47.114:/bricks/brick10/testvol
Brick20: 10.70.47.174:/bricks/brick10/testvol
Brick21: 10.70.47.114:/bricks/brick11/testvol
Brick22: 10.70.47.174:/bricks/brick11/testvol
Brick23: 10.70.47.114:/bricks/brick12/testvol
Brick24: 10.70.47.174:/bricks/brick12/testvol
Options Reconfigured:
performance.readdir-ahead: on
[root at dht-rhs-24 ~]#
4. I started the rebalance which failed.
[root at dht-rhs-24 ~]# gluster v rebalance testvol status
Node Rebalanced-files size
scanned failures skipped status run time in secs
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes
0 0 0 completed 2.00
10.70.47.114 0 0Bytes
3 2 0 failed 1.00
volume rebalance: testvol: success:
[root at dht-rhs-24 ~]#
5. I noticed I didn't had the latest build so I upgraded the build which
installed successfully.
6.
Cleanup : glusterfs-server-3.7.1-1.el6rhs.x86_64
7/12
Cleanup : glusterfs-fuse-3.7.1-1.el6rhs.x86_64
8/12
Cleanup : glusterfs-api-3.7.1-1.el6rhs.x86_64
9/12
Cleanup : glusterfs-3.7.1-1.el6rhs.x86_64
10/12
Cleanup : glusterfs-cli-3.7.1-1.el6rhs.x86_64
11/12
Cleanup : glusterfs-libs-3.7.1-1.el6rhs.x86_64
12/12
Verifying : glusterfs-api-3.7.1-3.el6rhs.x86_64
1/12
Verifying : glusterfs-cli-3.7.1-3.el6rhs.x86_64
2/12
Verifying : glusterfs-libs-3.7.1-3.el6rhs.x86_64
3/12
Verifying : glusterfs-fuse-3.7.1-3.el6rhs.x86_64
4/12
Verifying : glusterfs-3.7.1-3.el6rhs.x86_64
5/12
Verifying : glusterfs-server-3.7.1-3.el6rhs.x86_64
6/12
Verifying : glusterfs-api-3.7.1-1.el6rhs.x86_64
7/12
Verifying : glusterfs-libs-3.7.1-1.el6rhs.x86_64
8/12
Verifying : glusterfs-server-3.7.1-1.el6rhs.x86_64
9/12
Verifying : glusterfs-cli-3.7.1-1.el6rhs.x86_64
10/12
Verifying : glusterfs-fuse-3.7.1-1.el6rhs.x86_64
11/12
Verifying : glusterfs-3.7.1-1.el6rhs.x86_64
12/12
Updated:
glusterfs.x86_64 0:3.7.1-3.el6rhs
glusterfs-api.x86_64 0:3.7.1-3.el6rhs
glusterfs-cli.x86_64 0:3.7.1-3.el6rhs
glusterfs-fuse.x86_64 0:3.7.1-3.el6rhs
glusterfs-libs.x86_64 0:3.7.1-3.el6rhs
glusterfs-server.x86_64 0:3.7.1-3.el6rhs
Complete!
[root at dht-rhs-24 /]#
[root at dht-rhs-24 /]# rpm -qa | grep -i gluster
glusterfs-client-xlators-3.7.1-3.el6rhs.x86_64
glusterfs-3.7.1-3.el6rhs.x86_64
glusterfs-libs-3.7.1-3.el6rhs.x86_64
glusterfs-cli-3.7.1-3.el6rhs.x86_64
glusterfs-fuse-3.7.1-3.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-7.el6rhs.x86_64
glusterfs-api-3.7.1-3.el6rhs.x86_64
glusterfs-server-3.7.1-3.el6rhs.x86_64
[root at dht-rhs-24 /]#
6. I started the rebalance again with force option.
This time, rebalance showed in progress was not processing any files.
7. So I stopped the rebalance. Rebalance successfully stopped.
[root at dht-rhs-24 /]# gluster v rebalance testvol stop
Node Rebalanced-files size
scanned failures skipped status run time in secs
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes
0 0 0 stopped 847.00
10.70.47.114 0 0Bytes
3 0 0 stopped 846.00
volume rebalance: testvol: success: rebalance process may be in the middle of a
file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick
related tasks on the volume.
8.
[root at dht-rhs-24 /]# gluster v rebalance testvol status
volume rebalance: testvol: failed: Rebalance not started.
[root at dht-rhs-24 /]#
In log files:
[2015-06-16 17:50:55.370277] I [MSGID: 109029]
[dht-rebalance.c:3056:gf_defrag_stop] 0-: Received stop command on rebalance
[2015-06-16 17:50:55.370382] I [MSGID: 109028]
[dht-rebalance.c:3023:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped.
Time taken is 847.00 secs
[2015-06-16 17:50:55.370412] I [MSGID: 109028]
[dht-rebalance.c:3027:gf_defrag_status_get] 0-glusterfs: Files migrated: 0,
size: 0, lookups: 0, failures: 0, skipped: 0
9. Now, when I try to add another set of bricks, it says rebalance is in
progress even though at the console and in the log files, it is clearly seen
that it is stopped.
[root at dht-rhs-24 /]#
[root at dht-rhs-24 /]# gluster v add-brick testvol
10.70.47.114:/bricks/brick13/testvol 10.70.47.174:/bricks/brick13/testvol
10.70.47.114:/bricks/brick14/testvol 10.70.47.174:/bricks/brick14/testvol
10.70.47.114:/bricks/brick15/testvol 10.70.47.174:/bricks/brick15/testvol
10.70.47.114:/bricks/brick16/testvol 10.70.47.174:/bricks/brick16/testvol
10.70.47.114:/bricks/brick17/testvol 10.70.47.174:/bricks/brick17/testvol
10.70.47.114:/bricks/brick18/testvol 10.70.47.174:/bricks/brick18/testvol
volume add-brick: failed: Volume name testvol rebalance is in progress. Please
retry after completion
[root at dht-rhs-24 /]#
[root at dht-rhs-24 /]# gluster v rebalance testvol status
volume rebalance: testvol: failed: Rebalance not started.
[root at dht-rhs-24 /]#
10. Expected result should be:
1. Rebalance should progress and not stuck.
2. Addition of brick should be successful if the rebalance is stopped.
NOTE: I did not see the crash/core as seen in this bug. The similar rebalance
failure has been seen in earlier releases too but it cannot be ascertained in
this case whether this issue is specific to EC volume till we have an RCA for
this. Rebalance in normal volumes have been sucessful so far otherwise.
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=1erpXSxoNn&a=cc_unsubscribe
More information about the Bugs
mailing list