[Gluster-infra] Unplanned Gerrit Outage yesterday

Fri Nov 3 16:12:58 UTC 2017

Le jeudi 02 novembre 2017 à 16:16 +0100, Nigel Babu a écrit :
> Hello folks,
> 
> Yesterday, we had an unplanned Gerrit outage. We have now determined
> that
> for some reason the machine rebooted for some reason. Michael is
> continuing
> to debug what lead to this issue. Gerrit does not start automatically
> when
> the VM restarted at this point.

So I did investigate, and ..... *roll drum*

that's a kernel crash. 

I suspect that's some weird race condition somewhere, given the
traceback I got with crash:

    [exception RIP: shmem_free_inode+19]
    RIP: ffffffff81198c23  RSP: ffff8816a232fd28  RFLAGS: 00010246
    RAX: ffff8817092dd440  RBX: 0000000000000000  RCX: 0000000100400009
    RDX: 000000010040000a  RSI: ffffea005c24b740  RDI: ffff8812ad30a800
    RBP: ffff8816a232fd38   R8: ffff8817092dd440   R9: 0000000100400009
    R10: 00000000092dd201  R11: ffffea005c24b740  R12: ffff8817092dd440
    R13: ffff880f6d8cc000  R14: ffff880f6d8cc018  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8816a232fd40] shmem_evict_inode at ffffffff8119d21f
#11 [ffff8816a232fd70] evict at ffffffff8121a2e7
#12 [ffff8816a232fd98] iput at ffffffff8121ab85
#13 [ffff8816a232fdc8] devpts_del_ref at ffffffff81283868
#14 [ffff8816a232fde0] pty_unix98_shutdown at ffffffff813ec526
#15 [ffff8816a232fdf8] release_tty at ffffffff813e1477
#16 [ffff8816a232fe10] tty_release at ffffffff813e26dd
#17 [ffff8816a232fea8] __fput at ffffffff81200109
#18 [ffff8816a232fef0] ____fput at ffffffff812003be
#19 [ffff8816a232ff00] task_work_run at ffffffff810ace97
#20 [ffff8816a232ff30] do_notify_resume at ffffffff8102ab22
#21 [ffff8816a232ff50] int_signal at ffffffff81696dbd

crash  /usr/lib/debug/lib/modules/3.10.0-514.10.2.el7.x86_64/vmlinux
./vmcore 

The only useful entry in dmesg  I found is:

[20035599.848892] VFS: Busy inodes after unmount of tmpfs. Self-
destruct in 5 seconds.  Have a nice day...

I didn't found open bug about it and I guess that unless I can
reproduce it, I can't do much. 

So for me, that's case closed (minus the systemd which we are testing
since 4 days: https://github.com/gluster/gluster.org_ansible_configurat
ion/commit/9aa279acf9316eae6ff7afff36ad630fc42edeff )

> We are currently testing a systemd unit file for Gerrit in staging.
> Once
> that's in place, we can ensure that we start Gerrit automatically
> when we
> restart the server.
> 
> Timeline of events (in CET):

16:24:24  kernel crash
16:26     kernel start

> 16:29 - I receive an alert that Gerrit is down. This goes ignored
> because
> we're still working on Jenkins.
> 
> 18:25 - I notice the alerts as we're packing up for the day and start
> Gerrit.

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20171103/425283f7/attachment.sig>