[Gluster-users] Production cluster planning

Thu Oct 27 07:03:17 UTC 2016

2016-10-26 23:38 GMT+02:00 Joe Julian <joe at julianfamily.org>:
> Quickly = MTTR is within tolerances to continue to meet SLA. It's just math.

Obviously yes. But in the real world, you can have the best SLAs in
the world, but if you loose data, you loose
customers.

> As for a dedicated heal network, split-horizon dns handles that just fine.
> Clients resolve a server's hostname to the "eth1" (for example) address and
> the servers themselves resolve the same hostname to the "eth0" address. We
> played with bonding but decided against the complexity.

Good Idea. Thanks. In this was, the cluster network is serparated from
the client network, like with ceph.
Just a question: you need two dns infrastructure for this, right ? ns1
and ns2 used by client pointing to eth0
and ns3 and ns4 used by gluster pointing to eth1.

In small environment the hosts file could be used, but I prefere the DNS way.

> There's preference and there's engineering to meet requirements. If your SLA
> is 5 nines and you engineer 6 nines, you may realize that the difference
> between a 99.99993% uptime and a 99.99997% uptime isn't worth the added
> expense of doing replication and raid-1.

How to you calculate the number of nines in this environment ?
In example, to have 6 nines (for availability and data consistency),
which configuration should I adopt ?
I can have 6 nines for the whole cluster but 2 nines for data.
In the first case, the whole cluster can't go totally down (tons of
node, as example), in the second, some data could
be lost (replica 1 or 2)

> With 300 drives, 60 bricks, replica 3 (across 3 racks), I have a six nines
> availability for any one replica subvolume. If you really want to fudge the
> numbers, the reliability for any given file is not worth calculating in that
> volume. The odds of all three bricks failing for any 1 file among 20
> distribute subvolumes is statistically infinitesimal.

How many servers ?
300 drives, bought in a very short time are willing to fail quicky
with multiple failure per time.
I had 2 drive failures in less than 1 hour some month ago. Hopefully I
was using a RAID-6
Both drives was from the same manufacturer and with sequential serial number.