[Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

Tue May 7 16:11:29 UTC 2019

Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit :
> Looks like is_nfs_export_available started failing again in recent
> centos-regressions.
> 
> Michael, can you please check?

I will try but I am leaving for vacation tonight, so if I find nothing,
until I leave, I guess Deepshika will have to look.

> On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <ykaul at redhat.com> wrote:
> 
> > 
> > 
> > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <
> > mscherer at redhat.com>
> > wrote:
> > 
> > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> > > > Is this back again? The recent patches are failing regression
> > > > :-\ .
> > > 
> > > So, on builder206, it took me a while to find that the issue is
> > > that
> > > nfs (the service) was running.
> > > 
> > > ./tests/basic/afr/tarissue.t failed, because the nfs
> > > initialisation
> > > failed with a rather cryptic message:
> > > 
> > > [2019-04-23 13:17:05.371733] I
> > > [socket.c:991:__socket_server_bind] 0-
> > > socket.nfs-server: process started listening on port (38465)
> > > [2019-04-23 13:17:05.385819] E
> > > [socket.c:972:__socket_server_bind] 0-
> > > socket.nfs-server: binding to  failed: Address already in use
> > > [2019-04-23 13:17:05.385843] E
> > > [socket.c:974:__socket_server_bind] 0-
> > > socket.nfs-server: Port is already in use
> > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
> > > socket.nfs-server: __socket_server_bind failed;closing socket 14
> > > 
> > > I found where this came from, but a few stuff did surprised me:
> > > 
> > > - the order of print is different that the order in the code
> > > 
> > 
> > Indeed strange...
> > 
> > > - the message on "started listening" didn't take in account the
> > > fact
> > > that bind failed on:
> > > 
> > 
> > Shouldn't it bail out if it failed to bind?
> > Some missing 'goto out' around line 975/976?
> > Y.
> > 
> > > 
> > > 
> > > 
> > > 
https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
> > > 
> > > The message about port 38465 also threw me off the track. The
> > > real
> > > issue is that the service nfs was already running, and I couldn't
> > > find
> > > anything listening on port 38465
> > > 
> > > once I do service nfs stop, it no longer failed.
> > > 
> > > So far, I do know why nfs.service was activated.
> > > 
> > > But at least, 206 should be fixed, and we know a bit more on what
> > > would
> > > be causing some failure.
> > > 
> > > 
> > > 
> > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <
> > > > mscherer at redhat.com>
> > > > wrote:
> > > > 
> > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a
> > > > > écrit :
> > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > > > > jthottan at redhat.com>
> > > > > > wrote:
> > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > is_nfs_export_available is just a wrapper around
> > > > > > > "showmount"
> > > > > > > command AFAIR.
> > > > > > > I saw following messages in console output.
> > > > > > >  mount.nfs: rpc.statd is not running but is required for
> > > > > > > remote
> > > > > > > locking.
> > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks
> > > > > > > local,
> > > > > > > or
> > > > > > > start
> > > > > > > statd.
> > > > > > > 05:06:55 mount.nfs: an incorrect mount option was
> > > > > > > specified
> > > > > > > 
> > > > > > > For me it looks rpcbind may not be running on the
> > > > > > > machine.
> > > > > > > Usually rpcbind starts automatically on machines, don't
> > > > > > > know
> > > > > > > whether it
> > > > > > > can happen or not.
> > > > > > > 
> > > > > > 
> > > > > > That's precisely what the question is. Why suddenly we're
> > > > > > seeing
> > > > > > this
> > > > > > happening too frequently. Today I saw atleast 4 to 5 such
> > > > > > failures
> > > > > > already.
> > > > > > 
> > > > > > Deepshika - Can you please help in inspecting this?
> > > > > 
> > > > > So we think (we are not sure) that the issue is a bit
> > > > > complex.
> > > > > 
> > > > > What we were investigating was nightly run fail on aws. When
> > > > > the
> > > > > build
> > > > > crash, the builder is restarted, since that's the easiest way
> > > > > to
> > > > > clean
> > > > > everything (since even with a perfect test suite that would
> > > > > clean
> > > > > itself, we could always end in a corrupt state on the system,
> > > > > WRT
> > > > > mount, fs, etc).
> > > > > 
> > > > > In turn, this seems to cause trouble on aws, since cloud-init 
> > > > > or
> > > > > something rename eth0 interface to ens5, without cleaning to
> > > > > the
> > > > > network configuration.
> > > > > 
> > > > > So the network init script fail (because the image say "start
> > > > > eth0"
> > > > > and
> > > > > that's not present), but fail in a weird way. Network is
> > > > > initialised
> > > > > and working (we can connect), but the dhclient process is not
> > > > > in
> > > > > the
> > > > > right cgroup, and network.service is in failed state.
> > > > > Restarting
> > > > > network didn't work. In turn, this mean that rpc-statd refuse
> > > > > to
> > > > > start
> > > > > (due to systemd dependencies), which seems to impact various
> > > > > NFS
> > > > > tests.
> > > > > 
> > > > > We have also seen that on some builders, rpcbind pick some IP
> > > > > v6
> > > > > autoconfiguration, but we can't reproduce that, and there is
> > > > > no ip
> > > > > v6
> > > > > set up anywhere. I suspect the network.service failure is
> > > > > somehow
> > > > > involved, but fail to see how. In turn, rpcbind.socket not
> > > > > starting
> > > > > could cause NFS test troubles.
> > > > > 
> > > > > Our current stop gap fix was to fix all the builders one by
> > > > > one.
> > > > > Remove
> > > > > the config, kill the rogue dhclient, restart network service.
> > > > > 
> > > > > However, we can't be sure this is going to fix the problem
> > > > > long
> > > > > term
> > > > > since this only manifest after a crash of the test suite, and
> > > > > it
> > > > > doesn't happen so often. (plus, it was working before some
> > > > > day in
> > > > > the
> > > > > past, when something did make this fail, and I do not know if
> > > > > that's a
> > > > > system upgrade, or a test change, or both).
> > > > > 
> > > > > So we are still looking at it to have a complete
> > > > > understanding of
> > > > > the
> > > > > issue, but so far, we hacked our way to make it work (or so
> > > > > do I
> > > > > think).
> > > > > 
> > > > > Deepshika is working to fix it long term, by fixing the issue
> > > > > regarding
> > > > > eth0/ens5 with a new base image.
> > > > > --
> > > > > Michael Scherer
> > > > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > > > 
> > > > > 
> > > > > --
> > > > 
> > > > - Atin (atinm)
> > > 
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-devel
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190507/08b070db/attachment-0001.sig>