[Bugs] [Bug 1310972] New: After GlusterD restart, Remove-brick commit happening even though data migration not completed.

Tue Feb 23 05:42:57 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1310972

            Bug ID: 1310972
           Summary: After GlusterD restart,  Remove-brick commit happening
                    even though data migration not completed.
           Product: GlusterFS
           Version: 3.7.8
         Component: glusterd
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: bsrirama at redhat.com, bugs at gluster.org,
                    storage-qa-internal at redhat.com
        Depends On: 1303028, 1303125, 1303269

+++ This bug was initially created as a clone of Bug #1303269 +++

+++ This bug was initially created as a clone of Bug #1303125 +++

Description of problem:
=======================
Have two node cluster with Distributed-Replica volume and mounted as fuse with
enough data  and started removing replica brick set which triggered rebalance,
during rebalance in progress, restarted glusterd on a node from where data
migration is happening, after that tried to commit the remove-brick, it's get
committed even though data migration not completed.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.5-17

How reproducible:
=================
Every time

Steps to Reproduce:
====================
1.Have a two node cluster with Distributed-Replica volume (2 *2 )
2.Mount the volume as Fuse and write enough data
3.Start replica brick set remove // will trigger the data migration
4.Using remove-brick status identify brick node from where data migration is
happening.
5. Restart glusterd on the node identified in step-4 during rebalance  in
progress
6.Try to commit the remove-brick //commit will happen with out fail.

Actual results:
===============
remove-brick commit happens even though rebalance not completed.

Expected results:
=================
remove-brick commit should not happen when rebalance is in progress.

Additional info:

--- Additional comment from Byreddy on 2016-01-29 10:55:45 EST ---

[root at dhcp42-84 ~]# gluster volume status
Status of volume: Dis-Rep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.84:/bricks/brick0/smp0       49157     0          Y       18500
Brick 10.70.43.6:/bricks/brick0/smp1        49162     0          Y       19368
Brick 10.70.42.84:/bricks/brick1/smp2       49158     0          Y       18519
Brick 10.70.43.6:/bricks/brick1/smp3        49163     0          Y       19387
NFS Server on localhost                     2049      0          Y       18541
Self-heal Daemon on localhost               N/A       N/A        Y       18546
NFS Server on 10.70.43.6                    2049      0          Y       19409
Self-heal Daemon on 10.70.43.6              N/A       N/A        Y       19414

Task Status of Volume Dis-Rep
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster peer status
Number of Peers: 1

Hostname: 10.70.43.6
Uuid: 2f8a267c-7e7c-488f-98b9-f816062aae58
State: Peer in Cluster (Connected)
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 start
volume remove-brick start: success
ID: fd0164f8-2cba-4b25-b881-bbeb7b323695
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost               59       351.4KB        
  417             0             0          in progress               7.00
                              10.70.43.6                0        0Bytes        
    0             0             0          in progress               7.00
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost               93       511.0KB        
  627             0             0          in progress              11.00
                              10.70.43.6                0        0Bytes        
    0             0             0          in progress              11.00
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost              113       569.2KB        
  710             0             0          in progress              13.00
                              10.70.43.6                0        0Bytes        
    0             0             0            completed              12.00
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: failed: use 'force' option as migration is in
progress
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# systemctl restart glusterd
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes        
    0             0             0          in progress               0.00
                              10.70.43.6                0        0Bytes        
    0             0             0            completed              12.00
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume remove-brick Dis-Rep  replica 2
10.70.42.84:/bricks/brick1/smp2  10.70.43.6:/bricks/brick1/smp3 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success
Check the removed bricks to ensure all files are migrated.
If files with data are found on the brick path, copy them via a gluster mount
point before re-purposing the removed brick. 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# 
[root at dhcp42-84 ~]# gluster volume status
Status of volume: Dis-Rep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.84:/bricks/brick0/smp0       49157     0          Y       18500
Brick 10.70.43.6:/bricks/brick0/smp1        49162     0          Y       19368
NFS Server on localhost                     2049      0          Y       19014
Self-heal Daemon on localhost               N/A       N/A        Y       19022
NFS Server on 10.70.43.6                    2049      0          Y       19582
Self-heal Daemon on 10.70.43.6              N/A       N/A        Y       19590

Task Status of Volume Dis-Rep
------------------------------------------------------------------------------
There are no active volume tasks

--- Additional comment from Vijay Bellur on 2016-01-29 22:21:00 EST ---

REVIEW: http://review.gluster.org/13323 (glusterd: set
decommission_is_in_progress flag for inprogress remove-brick op on glusterd
restart) posted (#1) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-01-30 01:50:12 EST ---

REVIEW: http://review.gluster.org/13323 (glusterd: set
decommission_is_in_progress flag for inprogress remove-brick op on glusterd
restart) posted (#2) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-02-01 23:34:07 EST ---

REVIEW: http://review.gluster.org/13323 (glusterd: set
decommission_is_in_progress flag for inprogress remove-brick op on glusterd
restart) posted (#3) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-02-23 00:42:34 EST ---

COMMIT: http://review.gluster.org/13323 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit 3ca140f011faa9d92a4b3889607fefa33ae6de76
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Sat Jan 30 08:47:35 2016 +0530

    glusterd: set decommission_is_in_progress flag for inprogress remove-brick
op on glusterd restart

    While remove brick is in progress, if glusterd is restarted since
decommission
    flag is not persisted in the store the same value is not retained back
resulting
    in glusterd not blocking remove brick commit when rebalance is already in
    progress.

    Change-Id: Ibbf12f3792d65ab1293fad1e368568be141a1cd6
    BUG: 1303269
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: http://review.gluster.org/13323
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Gaurav Kumar Garg <ggarg at redhat.com>
    Reviewed-by: mohammed rafi  kc <rkavunga at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1303028
[Bug 1303028] Tiering status and rebalance status stops getting updated
https://bugzilla.redhat.com/show_bug.cgi?id=1303125
[Bug 1303125] After GlusterD restart,  Remove-brick commit happening even
though data migration not completed.
https://bugzilla.redhat.com/show_bug.cgi?id=1303269
[Bug 1303269] After GlusterD restart,  Remove-brick commit happening even
though data migration not completed.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.