[Gluster-devel] [Gluster-infra] rebal-all-nodes-migrate.t always fails now

Thu Apr 4 15:55:39 UTC 2019

Thanks misc. I have always seen a pattern that on a reattempt (recheck
centos) the same builder is picked up many time even though it's promised
to pick up the builders in a round robin manner.

On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer <mscherer at redhat.com> wrote:

> Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
> > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
> > > > Based on what I have seen that any multi node test case will fail
> > > > and
> > > > the
> > > > above one is picked first from that group and If I am correct
> > > > none
> > > > of
> > > > the
> > > > code fixes will go through the regression until this is fixed. I
> > > > suspect it
> > > > to be an infra issue again. If we look at
> > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > > > https://build.gluster.org/job/centos7-regression/5382/ peer
> > > > handshaking is
> > > > stuck as 127.1.1.1 is unable to receive a response back, did we
> > > > end
> > > > up
> > > > having firewall and other n/w settings screwed up? The test never
> > > > fails
> > > > locally.
> > >
> > > The firewall didn't change, and since the start has a line:
> > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
> > > interface
> > > work. (I am not even sure that netfilter do anything meaningful on
> > > the
> > > loopback interface, but maybe I am wrong, and not keen on looking
> > > kernel code for that).
> > >
> > >
> > > Ping seems to work fine as well, so we can exclude a routing issue.
> > >
> > > Maybe we should look at the socket, does it listen to a specific
> > > address or not ?
> >
> > So, I did look at the 20 first ailure, removed all not related to
> > rebal-all-nodes-migrate.t and seen all were run on builder203, who
> > was
> > freshly reinstalled. As Deepshika noticed today, this one had a issue
> > with ipv6, the 2nd issue we were tracking.
> >
> > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6
> > being
> > disabled, and the fix is to reload systemd. We have so far no idea on
> > why it happen, but suspect this might be related to the network issue
> > we did identify, as that happen only after a reboot, that happen only
> > if a build is cancelled/crashed/aborted.
> >
> > I apply the workaround on builder203, so if the culprit is that
> > specific issue, guess that's fixed.
> >
> > I started a test to see how it go:
> > https://build.gluster.org/job/centos7-regression/5383/
>
> The test did just pass, so I would assume the problem was local to
> builder203. Not sure why it was always selected, except because this
> was the only one that failed, so was always up for getting new jobs.
>
> Maybe we should increase the number of builder so this doesn't happen,
> as I guess the others builders were busy at that time ?
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190404/d98ad6e2/attachment.html>