[Gluster-users] remove-brick removed unexpected bricks

Thu Aug 15 05:41:23 UTC 2013

I'm gonna stop debugging this as I still cannot figure out how to 
reproduce this problem for further debug. I did 4~5 rounds of tests (all 
from scratch) yesterday and today, only met the problem once Monday 
afternoon, but repeating the steps didn't give me the same result. Also 
checked log there was nothing wrong except rebalance was happening on 
the wrong bricks.

I will raise this again if I can have any useful information.

-C.B.

On 8/13/2013 7:00 AM, Cool wrote:
> Thanks Ravi, I manged to reproduce the issue for 2 times in the past 
> several days, but without anything significant in log, volume info and 
> after shows correct information (i.e. sdd1 got removed though data was 
> not migrated out), rebalance.log telling it was migrating data out of 
> sdc1, not sdd1.
>
> I'm doing another try now with -L TRACE to see if I can get more log 
> information, this will take some time, will post here if I find 
> anything helpful.
>
> -C.B.
> On 8/13/2013 6:49 AM, Ravishankar N wrote:
>> On 08/13/2013 06:21 PM, Cool wrote:
>>> I'm pretty sure I did "watch ... remove-brick ... status" till it 
>>> mentioned everything is completed before trigger commit, I should 
>>> make it clear in my previous mail.
>>>
>>> Actually you can read my mail again - in step #5, files on /sdc1 got 
>>> migrated instead of /sdd1, even though my command was trying to 
>>> remove-brick /sdd1, 
>> Ah, my bad. Got it now. This is strange..
>>> this is the root cause (to me) that caused the problem, as data on 
>>> /sdc1 migrated to /sdb1 and /sdd1, then commit simply remove /sdd1 
>>> from gfs_v0. It seems vol definition information got some problem in 
>>> gluster.
>> If you are able to reproduce the issue, does 'gluster volume info' 
>> show the correct bricks before and after start-status-commit 
>> operations of removing sdd1? You could also see if there are any 
>> error messages in /var/log/glusterfs/<volname>-rebalance.log
>>
>> -Ravi
>>>
>>> -C.B.
>>>
>>> On 8/12/2013 9:51 PM, Ravishankar N wrote:
>>>> On 08/13/2013 03:43 AM, Cool wrote:
>>>>> remove-brick in 3.4.0 seems removing wrong bricks, can someone 
>>>>> help to review the environment/steps to see if I did anything stupid?
>>>>>
>>>>> setup - Ubuntu 12.04LTS on gfs11 and gfs12, with following 
>>>>> packages from ppa, both nodes have 3 xfs partitions sdb1, sdc1, sdd1:
>>>>> ii  glusterfs-client 3.4.0final-ubuntu1~precise1 clustered 
>>>>> file-system (client package)
>>>>> ii  glusterfs-common 3.4.0final-ubuntu1~precise1 GlusterFS common 
>>>>> libraries and translator modules
>>>>> ii  glusterfs-server 3.4.0final-ubuntu1~precise1 clustered 
>>>>> file-system (server package)
>>>>>
>>>>> step to reproduce the problem:
>>>>> 1. create volume gfs_v0 in replica 2 with gfs11:/sdb1 and gfs12:/sdb1
>>>>> 2. add-brick gfs11:/sdc1 and gfs12:/sdc1
>>>>> 3. add-brick gfs11:/sdd1 and gfs12:/sdd1
>>>>> 4. rebalance to make files distributed to all three pair of disks
>>>>> 5. remove-brick gfs11:/sdd1 and gfs12:/sdd1 start, files on 
>>>>> ***/sdc1*** are migrating out
>>>>> 6. remove-brick commit led to data loss in gfs_v0
>>>>>
>>>>> If between step 5 and 6 I initiate a remove-brick targeting /sdc1, 
>>>>> then after commit I would not lose anything since all data will be 
>>>>> migrated back to /sdb1.
>>>>>
>>>>
>>>> You should ensure  that a 'remove-brick  start ' has completed and 
>>>> then commit it before initiating the second one. The correct way to 
>>>> do this would be:
>>>> 5.   # gluster volume remove-brick gfs_v0 gfs11:/sdd1 gfs12:/sdd1 
>>>> start
>>>> 6. Check that the data migration has been completed using the 
>>>> status command:
>>>>       # gluster volume remove-brick gfs_v0 gfs11:/sdd1 gfs12:/sdd1 
>>>> status
>>>> 7.   #gluster volume remove-brick gfs_v0 gfs11:/sdd1 gfs12:/sdd1 
>>>> commit
>>>> 8.   # gluster volume remove-brick gfs_v0 gfs11:/sdc1 gfs12:/sdc1 
>>>> start
>>>> 9.   # gluster volume remove-brick gfs_v0 gfs11:/sdc1 gfs12:/sdc1 
>>>> status
>>>> 10. # gluster volume remove-brick gfs_v0 gfs11:/sdc1 gfs12:/sdc1 
>>>> commit
>>>>
>>>> This would leave you with the original replica 2 volume that you 
>>>> had begun with. Hope this helps.
>>>>
>>>> Note:
>>>> The latest version of glusterfs has the check that prevents a 
>>>> second remove-brick operation until the first one has been committed.
>>>> (You would receive a message thus : "volume remove-brick start: 
>>>> failed: An earlier remove-brick task exists for volume <volname>. 
>>>> Either commit it or stop it before starting a new task." )
>>>>
>>>> -Ravi
>>>>
>>>>
>>>>> -C.B.
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>