[Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

James purpleidea at gmail.com
Fri Oct 11 04:20:20 UTC 2013

On Sun, 2013-09-29 at 22:41 -0700, Anand Avati wrote:
> I see what you are asking. First of all, when running a 2-replica
> volume
> you almost pretty much always want to have an even number of servers,
> and
> add servers in even numbers. Ideally the two "sides" of the replicas
> should
> be placed in separate failures zones - separate racks with separate
> power
> supplies or separate AZs in the cloud. Having an odd number of servers
> with
> an 2 replicas is a very "odd" configuration. In all these years I am
> yet to
> come across a customer who has a production cluster with 2 replicas
> and an
> odd number of servers. And setting up replicas in such a chained
> manner
> makes it hard to reason about availability, especially when you are
> trying
> recover from a disaster. Having clear and separate "pairs" is
> definitely
> what is recommended.
Obviously I completely agree. In fact, I've written most of the code for
this scenario, however I'm trying to build out my code to support the
general case.

> That being said, nothing prevents one from setting up a chain like
> above as
> long as you are comfortable with the complexity of the configuration.
> And
> phasing out replace-brick in favor of add-brick/remove-brick does not
> make
> the above configuration impossible either. Let's say you have a
> chained
> configuration of N servers, with pairs formed between every:
> h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
Perfect... So far, so good.

> Now you add N+1th server.
This server will be "N" because, we're zero-based in your example...
> Using replace-brick, you have been doing thus far:
> 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous
> brick"
Here is that server, we complete the chain from hN to H0. Let's change
the name of h0:/b2a  to  h0:/b2-tmp instead. The problem is that this
hopes we have room for a b2-tmp on h0 !

> 2. replace-brick h0:/b2 hN:/b2 start ... commit
Here if you meant h0:/b2a aka h0:/b2-tmp (instead of h0:/b2) doesn't
this break the chain ? Since now hN is now a stand alone with b1 and b2,
and not part of the chain? In fact, the b1 and b2 on hN are actually
replicas of each other so this is a SPOF.

> In case you are doing an add-brick/remove-brick approach, you would
> now
> instead do:
> 1. add-brick h(N-1):/b1a hN:/b2
> 2. add-brick hN:/b1 h0:/b2a
> 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
I think this algorithm works. Although I'd have to test it :P
The one downside (which I actually have a work around to) is that the
new bricks have to be named different things than the original ones. Is
there a way around this?

> You will not be left with only 1 copy of a file at any point in the
> process, and achieve the same "end result" as you were with
> replace-brick.
> As mentioned before, I once again request you to consider if you
> really
> want to deal with the configuration complexity of having chained
> replication, instead of just adding servers in pairs.
I am just trying to avoid corner cases in my code. Puppet won't work
well with those :P
> Please ask if there are any more questions or concerns.
I have some follow up, but for the moment, I have another question to
add into this thread. It's the same idea really... Suppose you have a
set of sanely named and ordered hosts and bricks. Is there one (and only
one) logical ordering for them? I've decided that the answer is yes, and
I've written the algorithm for ordering them:


Do you have any comments / objections ?

I've attached an easy standalone version of this code to run.

I also have a more complicated version of this code.
This code does almost the same thing as the first version.
The difference is that this version supports a proposed "brick
nomenclature". (See below)

What does this all mean? My theory: If you can define a logical brick
and hostname naming convention, and that you always use it, then for
every given list of bricks, there should be only one logical
"ordering" (where an ordering is the linear order needed for a create
volume command).

Secondly, if you want to add or remove bricks, and you do so by
following the naming convention, then the combined old list + new bricks
can also be sorted in a single linear ordering. Furthermore, there
exists an algorithm that can compute the needed add/remove brick
commands to transform from the initial set to the second set.

I've attached this algorithm here:

The only other thing to mention is the brick nomenclature:
It is:

where b is a constant char 'b'
where xxxxxxx is a zero padded int for brick #
where #vzzzz is a constant '#v' followed by zzzz
where zzzz is a zero padded int for version #

each time new bricks are added, you increment the max visible version #
and use that. if no version number is specified, then we assume version
1. The length of padding must be decided on in advance and can't be

valid brick names include:



and so on...

Hostnames are simple: hostnameYYYY where YYYY is a padded int, and you
distribute your hosts sequentially across racks or switches or whatever
your commonality for SPOF is.

Technically, for the transforms, I'm not even sure the version # is

The big problem with my algorithms, is that they don't work for chained
configurations. I'd love to be able to make that so!!!

Why is all this relevant ? Because if I can solve these problems,
Gluster users can have fully decentralized elastic volumes that
grow/shrink on demand, without ever having to manually run add/remove
brick commands. I'll be able to do all of this with puppet-gluster for
example. Users will just run puppet, without changing and
configurations, and hosts will automatically come up and grow to the
size the hardware supports. Most of the code is already published. More
to come.

Hope that was all understandable. It's probably hard to talk about this
by email, but I'm trying. :)


> Avati

-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick_logic_ordering_wip.rb
Type: application/x-ruby
Size: 7595 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20131011/90037bea/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick_logic_ordering_v2_wip.rb
Type: application/x-ruby
Size: 6050 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20131011/90037bea/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick_logic_transform_v1_wip.rb
Type: application/x-ruby
Size: 11354 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20131011/90037bea/attachment-0005.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20131011/90037bea/attachment-0001.sig>

More information about the Gluster-devel mailing list