[Gluster-users] Best Practices for different failure scenarios?

Wed Feb 19 22:08:05 UTC 2014

On Wed, Feb 19, 2014 at 4:50 PM, Michael Peek <peek at nimbios.org> wrote:
> Thanks for the quick reply.
>
> On 02/19/2014 03:15 PM, James wrote:
>> Short answer, it sounds like you'd benefit from playing with a test
>> cluster... Would I be correct in guessing that you haven't setup a
>> gluster pool yet? You might want to look at:
>> https://ttboj.wordpress.com/2014/01/08/automatically-deploying-glusterfs-with-puppet-gluster-vagrant/
>> This way you can try them out easily...
>
> You're close.  I've got a test cluster up and running now, and I'm about
> to go postal on it to see just in how many different ways I can break
> it, and what I need to know to bring it back to life.
"Go postal on it" -- I like this.
Remember: if you break it, you get to keep both pieces!

>
>> For some of those points... solve them with...
>>>  Sort of a crib notes for things like:
>>>
>>> 1) What do you do if you see that a drive is about to fail?
>>>
>>> 2) What do you do if a drive has already failed?
>> RAID6
>
> Derp.  Shoulda seen that one.
Typically on iron, people with have been 2 and N different bricks,
each composed of a RAID6 set. Other setups are possible depending on
what kind of engineering you're doing.

>
>>> 3) What do you do if a peer is about to fail?
>> Get a new peer ready...
>
> Here's what I think needs to happen, correct me if I've got this wrong:
> 1) Set up a new host with gluster installed
> 2) From the new host, probe one of the other peers (or from one of the
> other peers, probe the new host)
The pool has to probe the peer. Not the other way around...

> 3) gluster volume replace-brick volname failing-host:/failing/brick
> new-host:/new/brick start
In latest gluster replace-brick is going away... Turning into
add/remove brick...
Try it out with a vagrant setup to get comfortable with it!

>
> Find out how it's going with:
> gluster volume replace-brick volname failing-host:/failing/brick
> new-host:/new/brick status
>
>>> 4) What do you do if a peer has failed?
>> Replace with new peer...
>>
>
> Same steps as (3) above, then:
> 4) gluster volume heal volname
> to begin copying data over from a replicant.
>
>>> 5) What do you do to reinstall a peer from scratch (i.e. what
>>> configuration files/directories do you need to restore to get the host
>>> back up and talking to the rest of the cluster)?
>> Bring up a new peer. Add to cluster... Same as failed peer...
>>
>
>
>>> 6) What do you do with failed-heals?
>>> 7) What do you do with split-brains?
>> These are more complex issues and a number of people have written about them...
>> Eg: http://joejulian.name/blog/fixing-split-brain-with-glusterfs-33/
>
> This covers split-brain, but what about failed-heal?  Do you do the same
> thing?
Depends on what has happened... Look at the logs, see what's going on.
Oh, make sure you aren't running out of disk space, because bad things
could happen... :P

>
> Michael

HTH
James