[Gluster-users] gluster remove-brick

Mon Feb 4 13:23:03 UTC 2019

Hi Nithya

I tried attching the logs but it was tool big. So I have put it on one
drive accessible by everyone

https://drive.google.com/drive/folders/1744WcOfrqe_e3lRPxLpQ-CBuXHp_o44T?usp=sharing

I am attaching rebalance-logs which is for the period when I ran fix-layout
after adding new disk and then started remove-disk option.

  All of the nodes have atleast 8 TB disk available

/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick001
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick002
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick003
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick004
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick005
/dev/sdb                  80T   67T   14T  83% /glusteratlas/brick006
/dev/sdb                  37T  1.6T   35T   5% /glusteratlas/brick007
/dev/sdb                  89T   15T   75T  17% /glusteratlas/brick008
/dev/sdb                  89T   14T   76T  16% /glusteratlas/brick009

brick007 is the one I am removing

gluster volume info

Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: pplxgluster01**:/glusteratlas/brick001/gv0
Brick2: pplxgluster02.**:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.**:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.**:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.**:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.**:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.**:/glusteratlas/brick007/gv0
Brick8: pplxgluster08.**:/glusteratlas/brick008/gv0
Brick9: pplxgluster09.**:/glusteratlas/brick009/gv0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
auth.allow: ***
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.md-cache-timeout: 600
performance.parallel-readdir: off
performance.cache-size: 1GB
performance.client-io-threads: on
cluster.lookup-optimize: on
client.event-threads: 4
server.event-threads: 4
performance.cache-invalidation: on
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING

Thanks

On Mon, Feb 4, 2019 at 11:37 AM Nithya Balachandran <nbalacha at redhat.com>
wrote:

> Hi,
>
>
> On Mon, 4 Feb 2019 at 16:39, mohammad kashif <kashif.alig at gmail.com>
> wrote:
>
>> Hi Nithya
>>
>> Thanks for replying so quickly. It is very much appreciated.
>>
>> There are lots if  " [No space left on device] " errors which I can not
>> understand as there are much space on all of the nodes.
>>
>
> This means that Gluster could not find sufficient space for the file.
> Would you be willing to share your rebalance log file?
> Please provide the following information:
>
>    - The gluster version
>    - The gluster volume info for the volume
>    - How full are the individual bricks for the volume?
>
>
>
>> A little bit of background will be useful in this case. I had cluster of
>> seven nodes of varying capacity(73, 73, 73, 46, 46, 46,46 TB) .  The
>> cluster was almost 90% full so every node has almost 8 to 15 TB free
>> space.  I added two new nodes with 100TB each and ran fix-layout which
>> completed successfully.
>>
>> After that I started remove-brick operation.  I don't think that any
>> point , any of the nodes were 100% full. Looking at my ganglia graph, there
>> is minimum 5TB always available at every node.
>>
>> I was keeping an eye on remove-brick status and for very long time there
>> was no failures and then at some point these 17000 failures appeared and it
>> stayed like that.
>>
>>  Thanks
>>
>> Kashif
>>
>>
>>
>>
>>
>> Let me explain a little bit of background.
>>
>>
>> On Mon, Feb 4, 2019 at 5:09 AM Nithya Balachandran <nbalacha at redhat.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The status shows quite a few failures. Please check the rebalance logs
>>> to see why that happened. We can decide what to do based on the errors.
>>> Once you run a commit, the brick will no longer be part of the volume
>>> and you will not be able to access those files via the client.
>>> Do you have sufficient space on the remaining bricks for the files on
>>> the removed brick?
>>>
>>> Regards,
>>> Nithya
>>>
>>> On Mon, 4 Feb 2019 at 03:50, mohammad kashif <kashif.alig at gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I have a pure distributed gluster volume with nine nodes and trying to
>>>> remove one node, I ran
>>>> gluster volume remove-brick atlasglust
>>>> nodename:/glusteratlas/brick007/gv0 start
>>>>
>>>> It completed but with around 17000 failures
>>>>
>>>>       Node Rebalanced-files          size       scanned      failures
>>>>      skipped               status  run time in h:m:s
>>>>                                ---------      -----------
>>>>  -----------   -----------   -----------   -----------
>>>>  ------------     --------------
>>>>           nodename          4185858        27.5TB       6746030
>>>>  17488             0            completed      405:15:34
>>>>
>>>> I can see that there is still 1.5 TB of data on the node which I was
>>>> trying to remove.
>>>>
>>>> I am not sure what to do now?  Should I run remove-brick command again
>>>> so the files which has been failed can be tried again?
>>>>
>>>> or should I run commit first and then try to remove node again?
>>>>
>>>> Please advise as I don't want to remove files.
>>>>
>>>> Thanks
>>>>
>>>> Kashif
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190204/4f7dfd17/attachment-0001.html>