[Gluster-users] Quorum and reboots

Thu Mar 10 16:24:43 UTC 2016

So like many I probably thought I had done my research and understood what
would happen when rebooting as brick/node only to find out I was wrong.

In my mind I saw I had a 1x3 replicate so I could rolling reboot and they'd
heal up.  However looking at logs of ovirt shortly after the rebooted brick
came up all vm's started pausing/going unresponsive.  At the time I was
puzzled and freaked out.  Next morning on my run I think I found the error
in my logic and reading comprehension of my research.  Once the 3rd brick
came up it had to heal and changes to all the VM's.  It is file based not
block based healing so it saw multi-GB files that it had to recopy over.
It had to halt all write to those files while that occurred or it would be
a never ending cycle of re-copying the large images.  So the fact most VM's
went haywire isnt that odd.  It does look based on timing in alerts the 2
bricks that were up kept serving images until 3rd brick came back.  It did
heal all images just fine.

So knowing what I believe I now know you can't really do what I had hoped
and just reboot one brick and have the VM's stay up all the time.  In order
to achieve something like that I'd need a 2nd set of bricks I could live
storage migrate to.

Am I understanding correctly how that works?

I could also look at minimizing downtime by moving to sharding and that way
the heal would only need to copy smaller files.  However I'd still end up
potentially with paused VM's unless those heals were pretty quick.
Probably safest to plan downtime of VM's or work out a storage migration
plan if I had a real need for a high number of 9's uptime.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160310/a426bd0e/attachment.html>