[Gluster-infra] Unplanned Gerrit Outage yesterday
Michael Scherer
mscherer at redhat.com
Fri Nov 3 16:12:58 UTC 2017
Le jeudi 02 novembre 2017 à 16:16 +0100, Nigel Babu a écrit :
> Hello folks,
>
> Yesterday, we had an unplanned Gerrit outage. We have now determined
> that
> for some reason the machine rebooted for some reason. Michael is
> continuing
> to debug what lead to this issue. Gerrit does not start automatically
> when
> the VM restarted at this point.
So I did investigate, and ..... *roll drum*
that's a kernel crash.
I suspect that's some weird race condition somewhere, given the
traceback I got with crash:
[exception RIP: shmem_free_inode+19]
RIP: ffffffff81198c23 RSP: ffff8816a232fd28 RFLAGS: 00010246
RAX: ffff8817092dd440 RBX: 0000000000000000 RCX: 0000000100400009
RDX: 000000010040000a RSI: ffffea005c24b740 RDI: ffff8812ad30a800
RBP: ffff8816a232fd38 R8: ffff8817092dd440 R9: 0000000100400009
R10: 00000000092dd201 R11: ffffea005c24b740 R12: ffff8817092dd440
R13: ffff880f6d8cc000 R14: ffff880f6d8cc018 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8816a232fd40] shmem_evict_inode at ffffffff8119d21f
#11 [ffff8816a232fd70] evict at ffffffff8121a2e7
#12 [ffff8816a232fd98] iput at ffffffff8121ab85
#13 [ffff8816a232fdc8] devpts_del_ref at ffffffff81283868
#14 [ffff8816a232fde0] pty_unix98_shutdown at ffffffff813ec526
#15 [ffff8816a232fdf8] release_tty at ffffffff813e1477
#16 [ffff8816a232fe10] tty_release at ffffffff813e26dd
#17 [ffff8816a232fea8] __fput at ffffffff81200109
#18 [ffff8816a232fef0] ____fput at ffffffff812003be
#19 [ffff8816a232ff00] task_work_run at ffffffff810ace97
#20 [ffff8816a232ff30] do_notify_resume at ffffffff8102ab22
#21 [ffff8816a232ff50] int_signal at ffffffff81696dbd
crash /usr/lib/debug/lib/modules/3.10.0-514.10.2.el7.x86_64/vmlinux
./vmcore
The only useful entry in dmesg I found is:
[20035599.848892] VFS: Busy inodes after unmount of tmpfs. Self-
destruct in 5 seconds. Have a nice day...
I didn't found open bug about it and I guess that unless I can
reproduce it, I can't do much.
So for me, that's case closed (minus the systemd which we are testing
since 4 days: https://github.com/gluster/gluster.org_ansible_configurat
ion/commit/9aa279acf9316eae6ff7afff36ad630fc42edeff )
> We are currently testing a systemd unit file for Gerrit in staging.
> Once
> that's in place, we can ensure that we start Gerrit automatically
> when we
> restart the server.
>
> Timeline of events (in CET):
16:24:24 kernel crash
16:26 kernel start
> 16:29 - I receive an alert that Gerrit is down. This goes ignored
> because
> we're still working on Jenkins.
>
> 18:25 - I notice the alerts as we're packing up for the day and start
> Gerrit.
--
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20171103/425283f7/attachment.sig>
More information about the Gluster-infra
mailing list