[Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

Mon Apr 22 17:27:57 UTC 2019

Is this back again? The recent patches are failing regression :-\ .

On Wed, 3 Apr 2019 at 19:26, Michael Scherer <mscherer at redhat.com> wrote:

> Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <jthottan at redhat.com>
> > wrote:
> >
> > > Hi,
> > >
> > > is_nfs_export_available is just a wrapper around "showmount"
> > > command AFAIR.
> > > I saw following messages in console output.
> > >  mount.nfs: rpc.statd is not running but is required for remote
> > > locking.
> > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > > start
> > > statd.
> > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > >
> > > For me it looks rpcbind may not be running on the machine.
> > > Usually rpcbind starts automatically on machines, don't know
> > > whether it
> > > can happen or not.
> > >
> >
> > That's precisely what the question is. Why suddenly we're seeing this
> > happening too frequently. Today I saw atleast 4 to 5 such failures
> > already.
> >
> > Deepshika - Can you please help in inspecting this?
>
> So we think (we are not sure) that the issue is a bit complex.
>
> What we were investigating was nightly run fail on aws. When the build
> crash, the builder is restarted, since that's the easiest way to clean
> everything (since even with a perfect test suite that would clean
> itself, we could always end in a corrupt state on the system, WRT
> mount, fs, etc).
>
> In turn, this seems to cause trouble on aws, since cloud-init or
> something rename eth0 interface to ens5, without cleaning to the
> network configuration.
>
> So the network init script fail (because the image say "start eth0" and
> that's not present), but fail in a weird way. Network is initialised
> and working (we can connect), but the dhclient process is not in the
> right cgroup, and network.service is in failed state. Restarting
> network didn't work. In turn, this mean that rpc-statd refuse to start
> (due to systemd dependencies), which seems to impact various NFS tests.
>
> We have also seen that on some builders, rpcbind pick some IP v6
> autoconfiguration, but we can't reproduce that, and there is no ip v6
> set up anywhere. I suspect the network.service failure is somehow
> involved, but fail to see how. In turn, rpcbind.socket not starting
> could cause NFS test troubles.
>
> Our current stop gap fix was to fix all the builders one by one. Remove
> the config, kill the rogue dhclient, restart network service.
>
> However, we can't be sure this is going to fix the problem long term
> since this only manifest after a crash of the test suite, and it
> doesn't happen so often. (plus, it was working before some day in the
> past, when something did make this fail, and I do not know if that's a
> system upgrade, or a test change, or both).
>
> So we are still looking at it to have a complete understanding of the
> issue, but so far, we hacked our way to make it work (or so do I
> think).
>
> Deepshika is working to fix it long term, by fixing the issue regarding
> eth0/ens5 with a new base image.
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> --
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20190422/b5802732/attachment-0001.html>