[Gluster-infra] Investigating random votes in Gerrit

Thu Jun 9 09:45:37 UTC 2016

On Thu, Jun 9, 2016 at 3:01 PM, Michael Scherer <mscherer at redhat.com> wrote:
> Le jeudi 09 juin 2016 à 14:32 +0530, Kaushal M a écrit :
>> On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M <kshlmster at gmail.com> wrote:
>> > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam
>> > <sarumuga at redhat.com> wrote:
>> >> Hi Kaushal,
>> >>
>> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is
>> >> failing in NETBSD.
>> >> Its log:
>> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/
>> >>
>> >> But the patch mentioned in NETBSD is another
>> >> one.(http://review.gluster.org/#/c/13872/)
>> >>
>> >
>> > Yup. We know this is happening, but don't know why yet. I'll keep this
>> > thread updated with any findings I have.
>> >
>> >> Thanks,
>> >> Saravana
>> >>
>> >>
>> >>
>> >> On 06/09/2016 11:52 AM, Kaushal M wrote:
>> >>>
>> >>> In addition to the builder issues we're having, we are also facing
>> >>> problems with jenkins voting/commenting randomly.
>> >>>
>> >>> The comments generally link to older jobs for older patchsets, which
>> >>> were run about 2 months back (beginning of April). For example,
>> >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from
>> >>> a job run in April for review 13873, and which actually failed.
>> >>>
>> >>> Another observation that I've made is that these fake votes sometime
>> >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore.
>> >>>
>> >>> These 2 observations, make me wonder if another jenkins instance is
>> >>> running somewhere, from our old backups possibly? Michael, could this
>> >>> be possible?
>> >>>
>> >>> To check from where these votes/comments were coming from, I tried
>> >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins
>> >>> apparently happen from 127.0.0.1. This is probably some firewall rule
>> >>> that has been setup, post migration, because I see older logs giving
>> >>> proper IPs. I'll require Michael's help with fixing this, if possible.
>> >>>
>> >>> I'll continue to investigate, and update this thread with anything I find.
>> >>>
>>
>> My guess was right!!
>>
>> This problem should now be fixed, as well as the problem with the builders.
>> The cause for both is the same: our old jenkins server, back from the
>> dead (zombie-jenkins from now on).
>>
>> The hypervisor in iWeb which hosted our services earlier, which was
>> supposed to be off,
>> had started up about 4 days back. This brought back zombie-jenkins.
>>
>> Zombie-jenkins continued from where is left off around early April. It
>> started getting gerrit events, and started running jobs for them.
>> Zombie-jenkins started numbering jobs from where it left off, and used
>> these numbers when reporting back to gerrit.
>> But these job numbers had already been used by new-jenkins about 2
>> months back when it started.
>> This is why the links in the comments pointed to the old jobs in new-jenkins.
>> I've checked logs on Gerrit (with help from Micheal) and can verify
>> that these comments/votes did come zombie-jenkins's IP.
>>
>> Zombie-jenkins also explains the random build failures being seen on
>> the builders.
>> Zombie-jenkins and new-jenkins each thought they had the slaves to
>> themselves and launched jobs on them,
>> causing jobs to clash sometimes, which resulted in random failures
>> reported in new-jenkins.
>> I'm yet to login to a slave and verify this, but I'm pretty sure this
>> what happened.
>>
>> For now, Michael has stopped the iWeb hypervisor and zombie-jenkins.
>> This should stop anymore random comments in Gerrit and failures in Jenkins.
>
> Well, i just stopped the 3 VM, and disabled them on boot (both xen and
> libvirt), so they should cause much trouble.

I hope something better than fire was used this time, it wasn't
effective last time.

>
>> I'll get Michael (once he's back on Monday) to figure out why
>> zombie-jenkins restarted,
>> and write up a proper postmortem about the issues.
>
> Oh, that part is easy to guess. We did ask to iweb to stop the server,
> that was supposed to happen around end of may (need to dig my mail) and
> I guess they did. Log stop at 29 may.
>
> Then someone did see it was down and restarted it around the 4th of June
> at 9h25. However, the server did seems to not have ntp running, so the
> time was off by 4h, so I am not sure if someone started it at 9h25 EDT,
> or 5h25 EDT. As the server is in Montreal, I would assume 9h25 is a
> reasonable time, but then, losing 4h in 4 days is a bit concerning.
> (at the same time, someone working at 5h in the morning would explain
> why the wrong server got restarted, I am also not that fresh at that
> time usually)
>
> Then, as the VM were configured to start on boot, so they all came back
> ~ 4 days ago.
>
> I guess digging more requires us to contact iweb, which can be done (we
> have 1 ex-iweb on rdo project, who still have good insider contacts)

This should be enough for writing up the postmortem.
I'm now trying to get proper evidence of zombie-jenkins causing the
build failures.
I'll write up the postmortem after that.

>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>