Best Practices for different failure scenarios?

Michael Peek peek at nimbios.org
Wed Feb 19 21:50:26 UTC 2014

Thanks for the quick reply.

On 02/19/2014 03:15 PM, James wrote:
> Short answer, it sounds like you'd benefit from playing with a test
> cluster... Would I be correct in guessing that you haven't setup a
> gluster pool yet? You might want to look at:
> https://ttboj.wordpress.com/2014/01/08/automatically-deploying-glusterfs-with-puppet-gluster-vagrant/
> This way you can try them out easily... 

You're close.  I've got a test cluster up and running now, and I'm about
to go postal on it to see just in how many different ways I can break
it, and what I need to know to bring it back to life.

> For some of those points... solve them with...
>>  Sort of a crib notes for things like:
>> 1) What do you do if you see that a drive is about to fail?
>> 2) What do you do if a drive has already failed?

Derp.  Shoulda seen that one.

>> 3) What do you do if a peer is about to fail?
> Get a new peer ready...

Here's what I think needs to happen, correct me if I've got this wrong:
1) Set up a new host with gluster installed
2) From the new host, probe one of the other peers (or from one of the
other peers, probe the new host)
3) gluster volume replace-brick volname failing-host:/failing/brick
new-host:/new/brick start

Find out how it's going with:
gluster volume replace-brick volname failing-host:/failing/brick
new-host:/new/brick status

>> 4) What do you do if a peer has failed?
> Replace with new peer...

Same steps as (3) above, then:
4) gluster volume heal volname
to begin copying data over from a replicant.

>> 5) What do you do to reinstall a peer from scratch (i.e. what
>> configuration files/directories do you need to restore to get the host
>> back up and talking to the rest of the cluster)?
> Bring up a new peer. Add to cluster... Same as failed peer...

>> 6) What do you do with failed-heals?
>> 7) What do you do with split-brains?
> These are more complex issues and a number of people have written about them...
> Eg: http://joejulian.name/blog/fixing-split-brain-with-glusterfs-33/

This covers split-brain, but what about failed-heal?  Do you do the same


