[Bugs] [Bug 1346854] New: Disk failed, can't (cleanly) remove brick.

Wed Jun 15 13:12:47 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1346854

            Bug ID: 1346854
           Summary: Disk failed, can't (cleanly) remove brick.
           Product: GlusterFS
           Version: 3.7.11
         Component: core
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: phil at solidstatescientific.com
                CC: bugs at gluster.org

Created attachment 1168372
  --> https://bugzilla.redhat.com/attachment.cgi?id=1168372&action=edit
compressed (bzip2) tarball of /var/log/glusterfs on server with failed disk

Description of problem:

The following is cut-n-paste from an email to gluster-users at gluster.org.  I
received a reply saying it looked like a bug and would I please submit a BZ. 
So here it is.

---- vvvv ---- Begin email cut-n-past ---- vvvv ----

Just started trying gluster, to decide if we want to put it into production.

Running version 3.7.11-1

Replicated, distributed volume, two servers, 20 bricks per server:

[root at storinator1 ~]# gluster volume status gv0
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick storinator1:/export/brick1/gv0        49153     0          Y       2554 
Brick storinator2:/export/brick1/gv0        49153     0          Y       9686 
Brick storinator1:/export/brick2/gv0        49154     0          Y       2562 
Brick storinator2:/export/brick2/gv0        49154     0          Y       9708 
Brick storinator1:/export/brick3/gv0        49155     0          Y       2568 
Brick storinator2:/export/brick3/gv0        49155     0          Y       9692 
Brick storinator1:/export/brick4/gv0        49156     0          Y       2574 
Brick storinator2:/export/brick4/gv0        49156     0          Y       9765 
Brick storinator1:/export/brick5/gv0        49173     0          Y       16901
Brick storinator2:/export/brick5/gv0        49173     0          Y       9727 
Brick storinator1:/export/brick6/gv0        49174     0          Y       16920
Brick storinator2:/export/brick6/gv0        49174     0          Y       9733 
Brick storinator1:/export/brick7/gv0        49175     0          Y       16939
Brick storinator2:/export/brick7/gv0        49175     0          Y       9739 
Brick storinator1:/export/brick8/gv0        49176     0          Y       16958
Brick storinator2:/export/brick8/gv0        49176     0          Y       9703 
Brick storinator1:/export/brick9/gv0        49177     0          Y       16977
Brick storinator2:/export/brick9/gv0        49177     0          Y       9713 
Brick storinator1:/export/brick10/gv0       49178     0          Y       16996
Brick storinator2:/export/brick10/gv0       49178     0          Y       9718 
Brick storinator1:/export/brick11/gv0       49179     0          Y       17015
Brick storinator2:/export/brick11/gv0       49179     0          Y       9746 
Brick storinator1:/export/brick12/gv0       49180     0          Y       17034
Brick storinator2:/export/brick12/gv0       49180     0          Y       9792 
Brick storinator1:/export/brick13/gv0       49181     0          Y       17053
Brick storinator2:/export/brick13/gv0       49181     0          Y       9755 
Brick storinator1:/export/brick14/gv0       49182     0          Y       17072
Brick storinator2:/export/brick14/gv0       49182     0          Y       9767 
Brick storinator1:/export/brick15/gv0       49183     0          Y       17091
Brick storinator2:/export/brick15/gv0       N/A       N/A        N       N/A  
Brick storinator1:/export/brick16/gv0       49184     0          Y       17110
Brick storinator2:/export/brick16/gv0       49184     0          Y       9791 
Brick storinator1:/export/brick17/gv0       49185     0          Y       17129
Brick storinator2:/export/brick17/gv0       49185     0          Y       9756 
Brick storinator1:/export/brick18/gv0       49186     0          Y       17148
Brick storinator2:/export/brick18/gv0       49186     0          Y       9766 
Brick storinator1:/export/brick19/gv0       49187     0          Y       17167
Brick storinator2:/export/brick19/gv0       49187     0          Y       9745 
Brick storinator1:/export/brick20/gv0       49188     0          Y       17186
Brick storinator2:/export/brick20/gv0       49188     0          Y       9783 
NFS Server on localhost                     2049      0          Y       17206
Self-heal Daemon on localhost               N/A       N/A        Y       17214
NFS Server on storinator2                   2049      0          Y       9657 
Self-heal Daemon on storinator2             N/A       N/A        Y       9677 

Task Status of Volume gv0
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 28c733e9-d618-44fc-873f-405d3b29a609
Status               : completed           

Wouldn't you know it, within a week or two of pulling the hardware together and
getting gluster installed and configured, a disk dies. Note the dead process
for brick15 on server storinator2.

I would like to remove (not replace) the failed brick (and its replica).  (I
don't have a spare disk handy, and there's plenty of room on the other bricks.)
 But gluster doesn't seem to want to remove a brick if the brick is dead:

[root at storinator1 ~]# gluster volume remove-brick gv0
storinator{1..2}:/export/brick15/gv0 start
volume remove-brick start: failed: Staging failed on storinator2. Error: Found
stopped brick storinator2:/export/brick15/gv0

So what do I do?  I can't remove the brick while the brick is bad, but I want
to remove the brick *because* the brick is bad.  Bit of a Catch-22.

Thanks in advance for any help you can give.

---- ^^^^ ----  End  email cut-n-past ---- ^^^^ ----

Version-Release number of selected component (if applicable):

Huh?

How reproducible:
I haven't a clue how readily reproducible this is.

Steps to Reproduce:
1. Create a volume like the one described above.
2. Wait for a disk to fail.  (You will, of course, want to force/fake a disk
failure, if there's a way to do so.)
3. Attempt to remove the failed brick and its replica

Actual results:
Can't (cleanly) remove failed brick and its replica

Expected results:
Can (cleanly) remove failed brick and its replica

Additional info:

No idea if I got the component right.  A guess based on my very limited
understanding of gluster architecture.

Got the job done in a roundabout way.  Tried a remove-brick force.  This
worked, but of course resulted in the data on the removed brick being gone from
the volume.  But since the replica brick was still sound, I was able to copy
the removed replica's contents to the gluster volume mount point.  This
cumbersome but effective workaround was the reason I did not select a higher
Severity for this bug report.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.