[Gluster-infra] Investigating random votes in Gerrit

Thu Jun 9 09:31:11 UTC 2016

Le jeudi 09 juin 2016 à 14:32 +0530, Kaushal M a écrit :
> On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M <kshlmster at gmail.com> wrote:
> > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam
> > <sarumuga at redhat.com> wrote:
> >> Hi Kaushal,
> >>
> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is
> >> failing in NETBSD.
> >> Its log:
> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/
> >>
> >> But the patch mentioned in NETBSD is another
> >> one.(http://review.gluster.org/#/c/13872/)
> >>
> >
> > Yup. We know this is happening, but don't know why yet. I'll keep this
> > thread updated with any findings I have.
> >
> >> Thanks,
> >> Saravana
> >>
> >>
> >>
> >> On 06/09/2016 11:52 AM, Kaushal M wrote:
> >>>
> >>> In addition to the builder issues we're having, we are also facing
> >>> problems with jenkins voting/commenting randomly.
> >>>
> >>> The comments generally link to older jobs for older patchsets, which
> >>> were run about 2 months back (beginning of April). For example,
> >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from
> >>> a job run in April for review 13873, and which actually failed.
> >>>
> >>> Another observation that I've made is that these fake votes sometime
> >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore.
> >>>
> >>> These 2 observations, make me wonder if another jenkins instance is
> >>> running somewhere, from our old backups possibly? Michael, could this
> >>> be possible?
> >>>
> >>> To check from where these votes/comments were coming from, I tried
> >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins
> >>> apparently happen from 127.0.0.1. This is probably some firewall rule
> >>> that has been setup, post migration, because I see older logs giving
> >>> proper IPs. I'll require Michael's help with fixing this, if possible.
> >>>
> >>> I'll continue to investigate, and update this thread with anything I find.
> >>>
> 
> My guess was right!!
> 
> This problem should now be fixed, as well as the problem with the builders.
> The cause for both is the same: our old jenkins server, back from the
> dead (zombie-jenkins from now on).
> 
> The hypervisor in iWeb which hosted our services earlier, which was
> supposed to be off,
> had started up about 4 days back. This brought back zombie-jenkins.
> 
> Zombie-jenkins continued from where is left off around early April. It
> started getting gerrit events, and started running jobs for them.
> Zombie-jenkins started numbering jobs from where it left off, and used
> these numbers when reporting back to gerrit.
> But these job numbers had already been used by new-jenkins about 2
> months back when it started.
> This is why the links in the comments pointed to the old jobs in new-jenkins.
> I've checked logs on Gerrit (with help from Micheal) and can verify
> that these comments/votes did come zombie-jenkins's IP.
> 
> Zombie-jenkins also explains the random build failures being seen on
> the builders.
> Zombie-jenkins and new-jenkins each thought they had the slaves to
> themselves and launched jobs on them,
> causing jobs to clash sometimes, which resulted in random failures
> reported in new-jenkins.
> I'm yet to login to a slave and verify this, but I'm pretty sure this
> what happened.
> 
> For now, Michael has stopped the iWeb hypervisor and zombie-jenkins.
> This should stop anymore random comments in Gerrit and failures in Jenkins.

Well, i just stopped the 3 VM, and disabled them on boot (both xen and
libvirt), so they should cause much trouble.

> I'll get Michael (once he's back on Monday) to figure out why
> zombie-jenkins restarted,
> and write up a proper postmortem about the issues.

Oh, that part is easy to guess. We did ask to iweb to stop the server,
that was supposed to happen around end of may (need to dig my mail) and
I guess they did. Log stop at 29 may.

Then someone did see it was down and restarted it around the 4th of June
at 9h25. However, the server did seems to not have ntp running, so the
time was off by 4h, so I am not sure if someone started it at 9h25 EDT,
or 5h25 EDT. As the server is in Montreal, I would assume 9h25 is a
reasonable time, but then, losing 4h in 4 days is a bit concerning.
(at the same time, someone working at 5h in the morning would explain
why the wrong server got restarted, I am also not that fresh at that
time usually)

Then, as the VM were configured to start on boot, so they all came back
~ 4 days ago. 

I guess digging more requires us to contact iweb, which can be done (we
have 1 ex-iweb on rdo project, who still have good insider contacts)

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20160609/5a93f370/attachment.sig>