[Gluster-infra] Investigating random votes in Gerrit

Kaushal M kshlmster at gmail.com
Thu Jun 9 09:02:04 UTC 2016


On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M <kshlmster at gmail.com> wrote:
> On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam
> <sarumuga at redhat.com> wrote:
>> Hi Kaushal,
>>
>> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is
>> failing in NETBSD.
>> Its log:
>> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/
>>
>> But the patch mentioned in NETBSD is another
>> one.(http://review.gluster.org/#/c/13872/)
>>
>
> Yup. We know this is happening, but don't know why yet. I'll keep this
> thread updated with any findings I have.
>
>> Thanks,
>> Saravana
>>
>>
>>
>> On 06/09/2016 11:52 AM, Kaushal M wrote:
>>>
>>> In addition to the builder issues we're having, we are also facing
>>> problems with jenkins voting/commenting randomly.
>>>
>>> The comments generally link to older jobs for older patchsets, which
>>> were run about 2 months back (beginning of April). For example,
>>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from
>>> a job run in April for review 13873, and which actually failed.
>>>
>>> Another observation that I've made is that these fake votes sometime
>>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore.
>>>
>>> These 2 observations, make me wonder if another jenkins instance is
>>> running somewhere, from our old backups possibly? Michael, could this
>>> be possible?
>>>
>>> To check from where these votes/comments were coming from, I tried
>>> checking the Gerrit sshd logs. This wasn't helpful, because all logins
>>> apparently happen from 127.0.0.1. This is probably some firewall rule
>>> that has been setup, post migration, because I see older logs giving
>>> proper IPs. I'll require Michael's help with fixing this, if possible.
>>>
>>> I'll continue to investigate, and update this thread with anything I find.
>>>

My guess was right!!

This problem should now be fixed, as well as the problem with the builders.
The cause for both is the same: our old jenkins server, back from the
dead (zombie-jenkins from now on).

The hypervisor in iWeb which hosted our services earlier, which was
supposed to be off,
had started up about 4 days back. This brought back zombie-jenkins.

Zombie-jenkins continued from where is left off around early April. It
started getting gerrit events, and started running jobs for them.
Zombie-jenkins started numbering jobs from where it left off, and used
these numbers when reporting back to gerrit.
But these job numbers had already been used by new-jenkins about 2
months back when it started.
This is why the links in the comments pointed to the old jobs in new-jenkins.
I've checked logs on Gerrit (with help from Micheal) and can verify
that these comments/votes did come zombie-jenkins's IP.

Zombie-jenkins also explains the random build failures being seen on
the builders.
Zombie-jenkins and new-jenkins each thought they had the slaves to
themselves and launched jobs on them,
causing jobs to clash sometimes, which resulted in random failures
reported in new-jenkins.
I'm yet to login to a slave and verify this, but I'm pretty sure this
what happened.

For now, Michael has stopped the iWeb hypervisor and zombie-jenkins.
This should stop anymore random comments in Gerrit and failures in Jenkins.

I'll get Michael (once he's back on Monday) to figure out why
zombie-jenkins restarted,
and write up a proper postmortem about the issues.

>>> ~kaushal
>>> _______________________________________________
>>> Gluster-infra mailing list
>>> Gluster-infra at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-infra
>>
>>


More information about the Gluster-infra mailing list