[Gluster-devel] [Gluster-infra] rebal-all-nodes-migrate.t always fails now

Fri Apr 5 14:40:08 UTC 2019

Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
> On Fri, 5 Apr 2019 at 12:16, Michael Scherer <mscherer at redhat.com>
> wrote:
> 
> > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > > I'm not convinced this is solved. Just had what I believe is a
> > > > similar
> > > > failure:
> > > > 
> > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
> > > > See
> > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
> > > > rpc.statd is
> > > > not running but is required for remote locking.*00:12:02.532*
> > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > > specified
> > > > 
> > > > (of course, it can always be my patch!)
> > > > 
> > > > https://build.gluster.org/job/centos7-regression/5384/console
> > > 
> > > same issue, different builder (206). I will check them all, as
> > > the
> > > issue is more widespread than I expected (or it did popup since
> > > last
> > > time I checked).
> > 
> > Deepshika did notice that the issue came back on one server
> > (builder202) after a reboot, so the rpcbind issue is not related to
> > the
> > network initscript one, so the RCA continue.
> > 
> > We are looking for another workaround involving fiddling with the
> > socket (until we find why it do use ipv6 at boot, but not after,
> > when
> > ipv6 is disabled).
> > 
> 
> Could this be relevant?
> https://access.redhat.com/solutions/2798411

Good catch.

So, we already do that, Nigel took care of that (after 2 days of
research). But I didn't knew the exact symptoms, and decided to double
check just in case.

And... there is no sysctl.conf in the initrd. Running dracut -v -f do
not change anything.

Running "dracut -v -f -H" take care of that (and this fix the problem),
but:
- our ansible script already run that
- -H is hostonly, which is already the default on EL7 according to the
doc.  

However, if dracut-config-generic is installed, it doesn't build a
hostonly initrd, and so do not include the sysctl.conf file (who break
rpcbnd, who break the test suite).

And for some reason, it is installed the image in ec2 (likely default),
but not by default on the builders.

So what happen is that after a kernel upgrade, dracut rebuild a generic
initrd instead of a hostonly one, who break things. And kernel was
likely upgraded recently (and upgrade happen nightly (for some value of
"night"), so we didn't see that earlier, nor with a fresh system.

So now, we have several solution:
- be explicit on using hostonly in dracut, so this doesn't happen again
(or not for this reason)

- disable ipv6 in rpcbind in a cleaner way (to be tested)

- get the test suite work with ip v6

In the long term, I also want to monitor the processes, but for that, I
need a VPN between the nagios server and ec2, and that project got
blocked by several issues (like EC2 not support ecdsa keys, and we use
that for ansible, so we have to come back to RSA for full automated
deployment, and openvon requires to use certificates, so I need a newer
python openssl for doing what I want, and RHEL 7 is too old, etc, etc).

As the weekend approach for me, I just rebuilt the initrd for the time
being. I guess forcing hostonly is the safest fix for now, but this
will be for monday.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190405/acdfd6a5/attachment.sig>