[Bugs] [Bug 1224116] server crashed during rebalance in dht_selfheal_layout_new_directory

Tue Jun 16 12:40:50 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1224116

Amit Chaurasia <achauras at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ON_QA                       |ASSIGNED
                 CC|                            |achauras at redhat.com

--- Comment #3 from Amit Chaurasia <achauras at redhat.com> ---
Marking this bug as failed because rebalance could not successfully complete.

Here is what I did:

1. Created a dist-disperse volume as below.
[root at dht-rhs-24 ~]# gluster v info

Volume Name: testvol
Type: Distributed-Disperse
Volume ID: 40b53648-3a68-4c15-baf2-ccb04cd82356
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.47.114:/bricks/brick1/testvol
Brick2: 10.70.47.174:/bricks/brick1/testvol
Brick3: 10.70.47.114:/bricks/brick2/testvol
Brick4: 10.70.47.174:/bricks/brick2/testvol
Brick5: 10.70.47.114:/bricks/brick3/testvol
Brick6: 10.70.47.174:/bricks/brick3/testvol
Brick7: 10.70.47.114:/bricks/brick4/testvol
Brick8: 10.70.47.174:/bricks/brick4/testvol
Brick9: 10.70.47.114:/bricks/brick5/testvol
Brick10: 10.70.47.174:/bricks/brick5/testvol
Brick11: 10.70.47.114:/bricks/brick6/testvol
Brick12: 10.70.47.174:/bricks/brick6/testvol
Options Reconfigured:
performance.readdir-ahead: on
[root at dht-rhs-24 ~]# 

2. Fuse mounted the volume on 2 clients & created files of 16M each. Total data
on volume was of 35G

10.70.47.174:/testvol
                      109G   35G   74G  33% /mnt/glusterfs

3. After file creation, I added a set of bricks to the volume:

[root at dht-rhs-24 ~]# gluster v add-brick testvol
10.70.47.114:/bricks/brick7/testvol 10.70.47.174:/bricks/brick7/testvol
10.70.47.114:/bricks/brick8/testvol 10.70.47.174:/bricks/brick8/testvol
10.70.47.114:/bricks/brick9/testvol 10.70.47.174:/bricks/brick9/testvol
10.70.47.114:/bricks/brick10/testvol 10.70.47.174:/bricks/brick10/testvol
10.70.47.114:/bricks/brick11/testvol 10.70.47.174:/bricks/brick11/testvol
10.70.47.114:/bricks/brick12/testvol 10.70.47.174:/bricks/brick12/testvol 

volume add-brick: success
[root at dht-rhs-24 ~]# 
[root at dht-rhs-24 ~]# 
[root at dht-rhs-24 ~]# gluster v info

Volume Name: testvol
Type: Distributed-Disperse
Volume ID: 40b53648-3a68-4c15-baf2-ccb04cd82356
Status: Started
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: 10.70.47.114:/bricks/brick1/testvol
Brick2: 10.70.47.174:/bricks/brick1/testvol
Brick3: 10.70.47.114:/bricks/brick2/testvol
Brick4: 10.70.47.174:/bricks/brick2/testvol
Brick5: 10.70.47.114:/bricks/brick3/testvol
Brick6: 10.70.47.174:/bricks/brick3/testvol
Brick7: 10.70.47.114:/bricks/brick4/testvol
Brick8: 10.70.47.174:/bricks/brick4/testvol
Brick9: 10.70.47.114:/bricks/brick5/testvol
Brick10: 10.70.47.174:/bricks/brick5/testvol
Brick11: 10.70.47.114:/bricks/brick6/testvol
Brick12: 10.70.47.174:/bricks/brick6/testvol
Brick13: 10.70.47.114:/bricks/brick7/testvol
Brick14: 10.70.47.174:/bricks/brick7/testvol
Brick15: 10.70.47.114:/bricks/brick8/testvol
Brick16: 10.70.47.174:/bricks/brick8/testvol
Brick17: 10.70.47.114:/bricks/brick9/testvol
Brick18: 10.70.47.174:/bricks/brick9/testvol
Brick19: 10.70.47.114:/bricks/brick10/testvol
Brick20: 10.70.47.174:/bricks/brick10/testvol
Brick21: 10.70.47.114:/bricks/brick11/testvol
Brick22: 10.70.47.174:/bricks/brick11/testvol
Brick23: 10.70.47.114:/bricks/brick12/testvol
Brick24: 10.70.47.174:/bricks/brick12/testvol
Options Reconfigured:
performance.readdir-ahead: on
[root at dht-rhs-24 ~]# 

4. I started the rebalance which failed.
[root at dht-rhs-24 ~]# gluster v rebalance testvol status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        
    0             0             0            completed               2.00
                            10.70.47.114                0        0Bytes        
    3             2             0               failed               1.00
volume rebalance: testvol: success: 
[root at dht-rhs-24 ~]# 

5. I noticed I didn't had the latest build so I upgraded the build which
installed successfully.

6. 
  Cleanup    : glusterfs-server-3.7.1-1.el6rhs.x86_64                      
7/12 
  Cleanup    : glusterfs-fuse-3.7.1-1.el6rhs.x86_64                        
8/12 
  Cleanup    : glusterfs-api-3.7.1-1.el6rhs.x86_64                         
9/12 
  Cleanup    : glusterfs-3.7.1-1.el6rhs.x86_64                            
10/12 
  Cleanup    : glusterfs-cli-3.7.1-1.el6rhs.x86_64                        
11/12 
  Cleanup    : glusterfs-libs-3.7.1-1.el6rhs.x86_64                       
12/12 
  Verifying  : glusterfs-api-3.7.1-3.el6rhs.x86_64                         
1/12 
  Verifying  : glusterfs-cli-3.7.1-3.el6rhs.x86_64                         
2/12 
  Verifying  : glusterfs-libs-3.7.1-3.el6rhs.x86_64                        
3/12 
  Verifying  : glusterfs-fuse-3.7.1-3.el6rhs.x86_64                        
4/12 
  Verifying  : glusterfs-3.7.1-3.el6rhs.x86_64                             
5/12 
  Verifying  : glusterfs-server-3.7.1-3.el6rhs.x86_64                      
6/12 
  Verifying  : glusterfs-api-3.7.1-1.el6rhs.x86_64                         
7/12 
  Verifying  : glusterfs-libs-3.7.1-1.el6rhs.x86_64                        
8/12 
  Verifying  : glusterfs-server-3.7.1-1.el6rhs.x86_64                      
9/12 
  Verifying  : glusterfs-cli-3.7.1-1.el6rhs.x86_64                        
10/12 
  Verifying  : glusterfs-fuse-3.7.1-1.el6rhs.x86_64                       
11/12 
  Verifying  : glusterfs-3.7.1-1.el6rhs.x86_64                            
12/12 

Updated:
  glusterfs.x86_64 0:3.7.1-3.el6rhs                                             
  glusterfs-api.x86_64 0:3.7.1-3.el6rhs                                         
  glusterfs-cli.x86_64 0:3.7.1-3.el6rhs                                         
  glusterfs-fuse.x86_64 0:3.7.1-3.el6rhs                                        
  glusterfs-libs.x86_64 0:3.7.1-3.el6rhs                                        
  glusterfs-server.x86_64 0:3.7.1-3.el6rhs                                      

Complete!
[root at dht-rhs-24 /]# 
[root at dht-rhs-24 /]# rpm -qa | grep -i gluster
glusterfs-client-xlators-3.7.1-3.el6rhs.x86_64
glusterfs-3.7.1-3.el6rhs.x86_64
glusterfs-libs-3.7.1-3.el6rhs.x86_64
glusterfs-cli-3.7.1-3.el6rhs.x86_64
glusterfs-fuse-3.7.1-3.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-7.el6rhs.x86_64
glusterfs-api-3.7.1-3.el6rhs.x86_64
glusterfs-server-3.7.1-3.el6rhs.x86_64
[root at dht-rhs-24 /]# 

6. I started the rebalance again with force option.

This time, rebalance showed in progress was not processing any files.

7. So I stopped the rebalance. Rebalance successfully stopped.

[root at dht-rhs-24 /]# gluster v rebalance testvol stop
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        
    0             0             0              stopped             847.00
                            10.70.47.114                0        0Bytes        
    3             0             0              stopped             846.00
volume rebalance: testvol: success: rebalance process may be in the middle of a
file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick
related tasks on the volume.

8.
[root at dht-rhs-24 /]# gluster v rebalance testvol status
volume rebalance: testvol: failed: Rebalance not started.
[root at dht-rhs-24 /]# 

In log files:

[2015-06-16 17:50:55.370277] I [MSGID: 109029]
[dht-rebalance.c:3056:gf_defrag_stop] 0-: Received stop command on rebalance
[2015-06-16 17:50:55.370382] I [MSGID: 109028]
[dht-rebalance.c:3023:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped.
Time taken is 847.00 secs
[2015-06-16 17:50:55.370412] I [MSGID: 109028]
[dht-rebalance.c:3027:gf_defrag_status_get] 0-glusterfs: Files migrated: 0,
size: 0, lookups: 0, failures: 0, skipped: 0

9. Now, when I try to add another set of bricks, it says rebalance is in
progress even though at the console and in the log files, it is clearly seen
that it is stopped.
[root at dht-rhs-24 /]# 
[root at dht-rhs-24 /]# gluster v add-brick testvol
10.70.47.114:/bricks/brick13/testvol 10.70.47.174:/bricks/brick13/testvol
10.70.47.114:/bricks/brick14/testvol 10.70.47.174:/bricks/brick14/testvol
10.70.47.114:/bricks/brick15/testvol 10.70.47.174:/bricks/brick15/testvol
10.70.47.114:/bricks/brick16/testvol 10.70.47.174:/bricks/brick16/testvol
10.70.47.114:/bricks/brick17/testvol 10.70.47.174:/bricks/brick17/testvol
10.70.47.114:/bricks/brick18/testvol 10.70.47.174:/bricks/brick18/testvol
volume add-brick: failed: Volume name testvol rebalance is in progress. Please
retry after completion
[root at dht-rhs-24 /]# 
[root at dht-rhs-24 /]# gluster v rebalance testvol status
volume rebalance: testvol: failed: Rebalance not started.
[root at dht-rhs-24 /]# 

10. Expected result should be:

1. Rebalance should progress and not stuck. 
2. Addition of brick should be successful if the rebalance is stopped.

NOTE: I did not see the crash/core as seen in this bug. The similar rebalance
failure has been seen in earlier releases too but it cannot be ascertained in
this case whether this issue is specific to EC volume till we have an RCA for
this. Rebalance in normal volumes have been sucessful so far otherwise.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=1erpXSxoNn&a=cc_unsubscribe