[Gluster-users] Automation of single server addition to replica

Wed Nov 9 20:13:02 UTC 2016

> And that's why I really prefere gluster, without any metadata or
> similiar.
> But metadata server aren't mandatory to archive automatic rebalance.
> Gluster is already able to rebalance and move data around the cluster,
> and already has the tool to add a single server even in a replica 3.
> 
> What i'm asking is to automate this feature.  Gluster could be able to
> move bricks around without user intervention.

Some of us have thought long and hard about this.  The root of the
problem is that our I/O stack works on the basic of replicating bricks,
not files.  Changing that would be hard, but so is working with it.
Most ideas (like Joe's) involve splitting larger bricks into smaller
ones, so that the smaller units can be arranged into more flexible
configurations.  So, for example, let's say you have bricks X through Z
each split in two.  Define replica sets along the diagonal and place
some files A through L.

                     Brick X   Brick Y   Brick Z
                   +---------+---------+---------+
   Subdirectory 1  | A B C D | E F G H | I J K L |
                   +---------+---------+---------+
   Subdirectory 2  | I J K L | A B C D | E F G H |
                   +---------+---------+---------+

Now you want to add a fourth brick on a fourth machine.  Now each
(divided) brick should contain three files instead of four, so some will
have to move.  Here's one possibility, based on our algorithms to
maximize overlaps between the old and new DHT hash ranges.

                     Brick X   Brick Y   Brick Z   Brick W
                   +---------+---------+---------+---------+
   Subdirectory 1  | A B C   | D E F   | J K L   | G H I   |
                   +---------+---------+---------+---------+
   Subdirectory 2  | G H I   | A B C   | D E F   | J K L   |
                   +---------+---------+---------+---------+

Even trying to minimize data motion, a third of all the files have to be
moved.  This can be reduced still further by splitting the original
bricks into even smaller parts, and that actually meshes quite well with
the "virtual nodes" technique used by other systems that do similar
hash-based distribution, but it gets so messy that I won't even try to
draw the pictures.  The main point is that doing all this requires
significant I/O, with significant impact on other activity on the
system, so it's not necessarily true that we should just do it without
user intervention.

Can we automate this process?  Yes, and we should.  This is already in
scope for GlusterD 2.  However, in addition to the obvious recalculation
and rebalancing, it also means setting up the bricks differently even
when a volume is first created, and making sure that we don't
double-count available space on two bricks that are really on the same
disks or LVs, and so on.  Otherwise, the initial setup will seem simple
but later side-effects could lead to confusion.