[Gluster-infra] Busted VM's in Rackspace

Tue Sep 9 23:16:25 UTC 2014

Interestingly, 3 of the slave VM's in Rackspace died today.

 * slave21, slave22, slave24

They stop responding to ssh/ping/network traffic, rebooting
doesn't help.  I've rebuild slave22, and I'm about to rebuild
slave24.

But, you mentioned a while ago that we should probably
investigate stuff like this properly, instead of just
rebuilding the VM's.

So, instead of rebuilding the slave21 VM, I've put it into
"rescue mode" in case you want to take a look.

In Rackspace's rescue mode, a new temporary VM is created
using the same IP address as the busted VM.  The filesystems
for the busted VM are made available to it, for doing
fsck/xfs_check/etc on as desired:

  /dev/xvdb1 - root filesystem of busted VM. ext3.
  /dev/xvde  - regression testing filesystem of busted VM. xfs.

Trying out fsck and xfs_check/xfs_repair on the filesystems
didn't do anything.  No filesystem corruption was reported.

Looking at the boot.log for slave21 (/dev/xvdb1 partition),
it definitely doesn't seem normal:

*************************************************************

# cat boot.log 
                Welcome to CentOS 
Starting udev: /bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:disk'
/bin/chown: invalid user: `root:lp'
/bin/chown: invalid user: `root:lp'
/bin/chown: invalid user: `root:lp'
/bin/chown: invalid user: `root:lp'
udevd[282]: specified user 'vcsa' unknown
udevd[282]: specified user 'vcsa' unknown

udevd[282]: can not read '/etc/udev/rules.d/70-persistent-net.rules'
udevd[282]: can not read '/etc/udev/rules.d/70-persistent-net.rules'

udevd[282]: can not read '/lib/udev/rules.d/75-persistent-net-generator.rules'
udevd[282]: can not read '/lib/udev/rules.d/75-persistent-net-generator.rules'

udevd[282]: can not read '/etc/udev/rules.d/80-net-name-slot.rules'
udevd[282]: can not read '/etc/udev/rules.d/80-net-name-slot.rules'

[  OK  ]
Setting hostname slave21.cloud.gluster.org:  [  OK  ]
Setting up Logical Volume Management:   No volume groups found
[  OK  ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/xvda1 
/dev/xvda1: clean, 79747/2621440 files, 1198520/10485504 blocks
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Mounting local filesystems:  [  OK  ]
chown: invalid user: `root:root'
Enabling /etc/fstab swaps:  [  OK  ]
telinit: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
init: rcS post-stop process (515) terminated with status 1

*************************************************************

Any ideas?

Regards and best wishes,

Justin Clift

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift