[Gluster-users] Advice on rebuilding underlying filesystem

Fri Apr 11 22:22:37 UTC 2014

That sounds good. So for a final question... I did an remove
and then added back a brick of the same size. Will the added
back brick inherit the hash space of the removed one? If so, 
the rebalance could efficiently put data on the new brick. 
Or, does the hash space just get jumbled with the new volume
showing up at the beginning or end or something, in which case
the rebalance will want to move nearly every file on my system.

On Apr 11, 2014, at 6:16 PM, Joe Julian <joe at julianfamily.org> wrote:

> The rebalance walks the directory tree and re-allocates the layout map to match the number of bricks you have. Then it walks the tree and moves files that don't match that map.
> 
> On 04/11/2014 03:01 PM, Andrew Smith wrote:
>> Since I am concerned about how the rebalance will work, in particular the
>> sequence of how files are moved. can someone set me straight?
>> 
>> Does the system sort through the files one by one and figure
>> out their correct location, or does it sort out the bricks one by one,
>> identifying the files on each brick that need to be relocated and shipping
>> them off?
>> 
>> 
>> 
>> On Apr 11, 2014, at 5:49 PM, Joe Julian <joe at julianfamily.org> wrote:
>> 
>>> I knew that deprecating something that's that useful was a bad idea.
>>> 
>>> Remove brick, iirc, marks the brick for removal and does a rebalance, so all it's files  (and I would expect any others that need be) are redistributed.
>>> Add brick does nothing but add it to the volume, after which a rebalance (at least the rebalance fix-layout) needs to be performed to allow it to accept files.
>>> Removing another brick is going to try to use the same rebalance algorithm to distribute the files that were on the brick being removed. They won't go to the "emptiest" brick but rather will be distributed based on the hash mapping.
>>> 
>>> I'm afraid the most efficient online method is going to be: remove-brick, add-brick, rebalance, repeat as necessary. Luckily with each brick you convert to xfs, the process should go faster and have less of an impact on your performance.
>>> 
>>> If you can afford downtime, you could always just copy from a BTRFS brick to an XFS one while the brick is offline, being sure to include the extended attributes.
>>> 
>>> 
>>> On 04/11/2014 02:38 PM, Andrew Smith wrote:
>>>> My understanding is that “replace-brick” is deprecated
>>>> 
>>>> http://www.gluster.org/pipermail/gluster-users/2012-October/034473.html
>>>> 
>>>> And that the “add-brick” followed by “remove-brick” should behave
>>>> the same way.
>>>> 
>>>> It does not behave as predicted, I think, because my system is
>>>> unbalanced. I have no idea whether or no the “replace-brick”
>>>> command will behave differently.
>>>> 
>>>> Andy
>>>> 
>>>> 
>>>> On Apr 11, 2014, at 5:34 PM, Machiel Groeneveld <machielg at gmail.com> wrote:
>>>> 
>>>>> Isn't that what replace-brick is for?
>>>>> 
>>>>> 
>>>>>> On 11 Apr 2014, at 23:32, Andrew Smith <smith.andrew.james at gmail.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hi, I have a problem, which I hope for your sake, is uncommon.
>>>>>> 
>>>>>> I built a Gluster  volume with 8 bricks, 4 80TB and 4 68TB with
>>>>>> a total capacity of about 600TB. The underlying filesystem
>>>>>> is BTRFS.
>>>>>> 
>>>>>> I found out after the system was half full that BTRFS was a
>>>>>> bad idea. BTRFS doesn’t have inodes. It allocates some fraction
>>>>>> of the disk space to metadata and when it runs out, it allocates
>>>>>> more. This allocation process on large volumes is painfully slow
>>>>>> and brings effective write speeds down to only a few MB/s with long
>>>>>> timeouts. The data can be read at high speeds, but writing to the
>>>>>> volume is a big fat mess. Reading is still fairly fast though,
>>>>>> so access to the my data by users is acceptable.
>>>>>> 
>>>>>> I need to keep this volume available and I don’t have a second
>>>>>> copy of the hardware to rebuild the system on. So, I need to do
>>>>>> an in-situ transition from BTRFS to XFS.
>>>>>> 
>>>>>> To do this, I first cleared out some data to free up metadata space,
>>>>>> and then with much difficulty managed to do a
>>>>>> 
>>>>>>  # gluster volume remove-brick
>>>>>> 
>>>>>> I retired the removed brick and then reformatted it with XFS and added
>>>>>> it back to my Gluster volume. At this point, I thought I was nearly
>>>>>> home. I thought I could retire a second brick and the data would
>>>>>> be copied to the empty brick. However, this is not what happens.
>>>>>> Some data ends up on the newly added brick, but some of the data
>>>>>> flows elsewhere, which due to the BTRFS problem is a nightmare.
>>>>>> 
>>>>>> I assume this is because when I took my volume from 8 bricks to 7, it
>>>>>> became unbalanced. The data on the brick that I was retiring
>>>>>> belongs on several different bricks and so I am not just doing a
>>>>>> substitution.
>>>>>> 
>>>>>> I need to be able to tell my Gluster volume to include all the bricks,
>>>>>> but do not write files to any of the BTRFS bricks so that it puts data
>>>>>> only on the XFS brick. If I could somehow tell Gluster that these bricks
>>>>>> were full, that would suffice.
>>>>>> 
>>>>>> I could do a "rebalance migrate-data" to make make the data on the BTRFS
>>>>>> volumes more uniform, but I don’t know how this will work. Does reposition
>>>>>> the data brick by brick or file by file. Brick by brick would be bad, since
>>>>>> the last brick to rebalance would need to receive all the data that it requires
>>>>>> before it would get to write data out to free up metadata space.
>>>>>> 
>>>>>> There is a “rebalance-brick” option in the man page, but I don’t see that
>>>>>> documented. This may be useful, but I have no idea what it will do.
>>>>>> 
>>>>>> Is there a solution to my problem? Whip it and start over is not helpful.
>>>>>> Any help on how I can predict where data will go will also help.
>>>>>> 
>>>>>> Andy
>