[Gluster-devel] Mechanisms for automatic management of Gluster

Wed Feb 12 17:24:08 UTC 2014

> This is along the lines of "tools for sysadmins". I plan on using
> these algorithms for puppet-gluster, but will try to maintain them
> separately as a standalone tool.
> 
> The problem: Given a set of bricks and servers, if they have a logical
> naming convention, can an algorithm decide the ideal order. This could
> allow parameters such as replica count, and
> chained=true/false/offset#.
> 
> The second problem: Given a set of bricks in a volume, if someone adds
> X bricks and removes Y bricks, is this valid, and what is the valid
> sequence of add/remove brick commands.
> 
> I've written some code with test cases to try and figure this all out.
> I've left out a lot of corner cases, but the boilerplate is there to
> make it happen. Hopefully it's self explanatory. (gluster.py) Read and
> run it.
> 
> Once this all works, the puppet-gluster use case is magic. It will be
> able to take care of these operations for you (if you want).
> 
> For non puppet users, this will give admins the confidence to know
> what commands they should _probably_ run in what order. I say probably
> because we assume that if there's an error, they'll stop and inspect
> first.
> 
> I haven't yet tried to implement the chained cases, or anything
> involving striping. There are also some corner cases with some of the
> current code. Once you add chaining and striping, etc, I realized it
> was time to step back and ask for help :)
> 
> I hope this all makes sense. Comments, code, test cases are appreciated!

It's a good start.  For the chained case, you'd probably want to start
with something like this:

        # Convert the input into a "list of lists" like this:
        #   [
        #      [ 'host1', [ 'path1', 'path2', ... ],
        #      [ 'host2', [ 'path1', 'path2', ... ],
        #      ...
        #   ]
        out_list = []
        while in_list:
                first_host = in_list.pop()
                first_path = first_host[1].pop()
                # If there are any bricks left on this host, move the host to
                # the end so the next iteration will start with the next host.
                # Otherwise, we've used all bricks from this host so discard.
                if first_host[1]:
                        in_list.append(first_host)
                second_host = in_list[0]
                second_host = second_host[1].pop()
                # Have we exhausted this host as well?
                if not second_host[1]:
                        del in_list[0]
                out_list.append({'host':first_host[0],'path',first_path})
                out_list.append({'host':second_host[0],'path',second_path})
        return out_list

(I haven't actually run this.  It's merely illustrative of the algorithm.)

Can you spot the bug?  If one host has more bricks than the others, it might
run out of bricks on other hosts to pair with, so it'll end up pairing with
itself.  For example, consider the following input:

        H1P1, H1P2, H1P3, H1P4, H2P1, H2P2, H3P1, H3P2

This algorithm would yield the following replica pairs.

        H1P1 + H2P1
        H2P2 + H3P1
        H3P2 + H1P2
        H1P3 + H1P4 (oops)

Instead, we need to find this:

        H1P1 + H2P1
        H2P2 + H1P2
        H1P3 + H3P1
        H3P2 + H1P4

I would actually not try to deal with this in the loop above.  Why not?
Because that loop's already going to get a bit hairy when it's enhanced to
handle replica counts greater than two.  Instead, I would deal with the
imbalance cases *up front* - check the number of bricks for each host, then
equalize them e.g. by splitting a host with many bricks into two virtual hosts
separated by enough others that they'll never pair with one another.

Alternatively, one could do a recursive implementation, roughly like this:

        if less than rep_factor hosts left, fail
        pick rep_factor bricks from different hosts
        loop:
                pass remainder to recursive call
                if result is valid, combine and return
                pick a *different* rep_factor bricks from different hosts

That will generate *some* valid order if any exists, but it will tend toward
sub-optimal orders where e.g. all of X's bricks are paired with all of Y's
instead of being spread around.  There might be some sort of "optimization"
pass we could do that would swap replica-set members to address this, but I'm
sure you can see it's already becoming a hard problem.  I'd have to code up
full versions of both algorithms and run them on many different inputs to say
with any confidence which is better.