[Gluster-users] Just coming out of a nightmare scenario

Fri Nov 16 18:23:50 UTC 2012

Wow.  I have 2 replica servers that host VM's via GlusterNFS.  uCARP handles the IP failover if one system dies.

Both systems were running fine.  In fact, I was logged into the backup NFS server watching a large file get created on a RAID-6 array exported via Gluster.

NAS-1 - primary GlusterNFS server
NAS-2 - backup GlusterNFS server

Suddenly, all VM's stopped responding.  NAS-1 showed 400% CPU usage (4 cores at 100%).  I waited about 30 seconds to see if things would come back to normal, but no go.  I shut down NAS-1 in order to let the failover take place, and NAS-2 to come on-line.

NAS-2 grabbed the IP address, but my Citrix XenServers were not reconnecting.  oh oh.  I reboot NAS-1 to bring it back up, and it boot into initramfs.  Crap.

I kept monitoring NAS-2, trying to figure out what was going on.  Ten minutes later, I realized NAS-2 had lost 4 of the 6 drives in its RAID-6 array.  Double crap.  The GlusterNFS server kept returning errors, since the RAID device was really no longer there.

Did some Google searching, and ended up typing 'exit' at the initramfs prompt on NAS-1.  The system came up fine.  I killed NAS-2 so the IP would fail over/back.  XenServer reconnected and all the VM's needed rebooting.  15 minutes with no disk is too long.

I fired up NAS-2 to see what the heck was going on.  I lost a motherboard SATA controller.  The timing could not have been worse.  I moved the drives to a PCIe SATA card, boot the system, rebuilt the RAID, and voila, everything is back up and syncing.

Just goes to show, even tested failovers can fail.

Not a fun evening.

Gerald