[Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

Deepshikha Khandelwal dkhandel at redhat.com
Wed Jun 5 06:57:21 UTC 2019


I recently added 3 builders builder208, builder209, builder210 to the
regression pool. Network to these new builders did not come up because it
was looking for non-existing ethernet card eth0 on reboot and hence
failing. I'll reconnect them back and update here once I fix the issue
today.

Sorry for the inconvenience.


On Tue, Jun 4, 2019 at 7:07 PM Yaniv Kaul <ykaul at redhat.com> wrote:

> What was the result of this investigation? I suspect seeing the same issue
> on builder209[1].
> Y.
>
> [1] https://build.gluster.org/job/centos7-regression/6302/consoleFull
>
> On Fri, Apr 5, 2019 at 5:40 PM Michael Scherer <mscherer at redhat.com>
> wrote:
>
>> Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
>> > On Fri, 5 Apr 2019 at 12:16, Michael Scherer <mscherer at redhat.com>
>> > wrote:
>> >
>> > > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
>> > > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
>> > > > > I'm not convinced this is solved. Just had what I believe is a
>> > > > > similar
>> > > > > failure:
>> > > > >
>> > > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
>> > > > > See
>> > > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
>> > > > > rpc.statd is
>> > > > > not running but is required for remote locking.*00:12:02.532*
>> > > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
>> > > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
>> > > > > specified
>> > > > >
>> > > > > (of course, it can always be my patch!)
>> > > > >
>> > > > > https://build.gluster.org/job/centos7-regression/5384/console
>> > > >
>> > > > same issue, different builder (206). I will check them all, as
>> > > > the
>> > > > issue is more widespread than I expected (or it did popup since
>> > > > last
>> > > > time I checked).
>> > >
>> > > Deepshika did notice that the issue came back on one server
>> > > (builder202) after a reboot, so the rpcbind issue is not related to
>> > > the
>> > > network initscript one, so the RCA continue.
>> > >
>> > > We are looking for another workaround involving fiddling with the
>> > > socket (until we find why it do use ipv6 at boot, but not after,
>> > > when
>> > > ipv6 is disabled).
>> > >
>> >
>> > Could this be relevant?
>> > https://access.redhat.com/solutions/2798411
>>
>> Good catch.
>>
>> So, we already do that, Nigel took care of that (after 2 days of
>> research). But I didn't knew the exact symptoms, and decided to double
>> check just in case.
>>
>> And... there is no sysctl.conf in the initrd. Running dracut -v -f do
>> not change anything.
>>
>> Running "dracut -v -f -H" take care of that (and this fix the problem),
>> but:
>> - our ansible script already run that
>> - -H is hostonly, which is already the default on EL7 according to the
>> doc.
>>
>> However, if dracut-config-generic is installed, it doesn't build a
>> hostonly initrd, and so do not include the sysctl.conf file (who break
>> rpcbnd, who break the test suite).
>>
>> And for some reason, it is installed the image in ec2 (likely default),
>> but not by default on the builders.
>>
>> So what happen is that after a kernel upgrade, dracut rebuild a generic
>> initrd instead of a hostonly one, who break things. And kernel was
>> likely upgraded recently (and upgrade happen nightly (for some value of
>> "night"), so we didn't see that earlier, nor with a fresh system.
>>
>>
>> So now, we have several solution:
>> - be explicit on using hostonly in dracut, so this doesn't happen again
>> (or not for this reason)
>>
>> - disable ipv6 in rpcbind in a cleaner way (to be tested)
>>
>> - get the test suite work with ip v6
>>
>> In the long term, I also want to monitor the processes, but for that, I
>> need a VPN between the nagios server and ec2, and that project got
>> blocked by several issues (like EC2 not support ecdsa keys, and we use
>> that for ansible, so we have to come back to RSA for full automated
>> deployment, and openvon requires to use certificates, so I need a newer
>> python openssl for doing what I want, and RHEL 7 is too old, etc, etc).
>>
>> As the weekend approach for me, I just rebuilt the initrd for the time
>> being. I guess forcing hostonly is the safest fix for now, but this
>> will be for monday.
>> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure and Platform, OSAS
>>
>>
>> _______________________________________________
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/836554017
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/486278655
>
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20190605/c80c8526/attachment-0001.html>


More information about the Gluster-infra mailing list