[Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

Tue May 7 14:34:52 UTC 2019

Looks like is_nfs_export_available started failing again in recent
centos-regressions.

Michael, can you please check?

On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <ykaul at redhat.com> wrote:

>
>
> On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <mscherer at redhat.com>
> wrote:
>
>> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
>> > Is this back again? The recent patches are failing regression :-\ .
>>
>> So, on builder206, it took me a while to find that the issue is that
>> nfs (the service) was running.
>>
>> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation
>> failed with a rather cryptic message:
>>
>> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
>> socket.nfs-server: process started listening on port (38465)
>> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
>> socket.nfs-server: binding to  failed: Address already in use
>> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
>> socket.nfs-server: Port is already in use
>> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
>> socket.nfs-server: __socket_server_bind failed;closing socket 14
>>
>> I found where this came from, but a few stuff did surprised me:
>>
>> - the order of print is different that the order in the code
>>
>
> Indeed strange...
>
>> - the message on "started listening" didn't take in account the fact
>> that bind failed on:
>>
>
> Shouldn't it bail out if it failed to bind?
> Some missing 'goto out' around line 975/976?
> Y.
>
>>
>>
>>
>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
>>
>> The message about port 38465 also threw me off the track. The real
>> issue is that the service nfs was already running, and I couldn't find
>> anything listening on port 38465
>>
>> once I do service nfs stop, it no longer failed.
>>
>> So far, I do know why nfs.service was activated.
>>
>> But at least, 206 should be fixed, and we know a bit more on what would
>> be causing some failure.
>>
>>
>>
>> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <mscherer at redhat.com>
>> > wrote:
>> >
>> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
>> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
>> > > > jthottan at redhat.com>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > is_nfs_export_available is just a wrapper around "showmount"
>> > > > > command AFAIR.
>> > > > > I saw following messages in console output.
>> > > > >  mount.nfs: rpc.statd is not running but is required for remote
>> > > > > locking.
>> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
>> > > > > or
>> > > > > start
>> > > > > statd.
>> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified
>> > > > >
>> > > > > For me it looks rpcbind may not be running on the machine.
>> > > > > Usually rpcbind starts automatically on machines, don't know
>> > > > > whether it
>> > > > > can happen or not.
>> > > > >
>> > > >
>> > > > That's precisely what the question is. Why suddenly we're seeing
>> > > > this
>> > > > happening too frequently. Today I saw atleast 4 to 5 such
>> > > > failures
>> > > > already.
>> > > >
>> > > > Deepshika - Can you please help in inspecting this?
>> > >
>> > > So we think (we are not sure) that the issue is a bit complex.
>> > >
>> > > What we were investigating was nightly run fail on aws. When the
>> > > build
>> > > crash, the builder is restarted, since that's the easiest way to
>> > > clean
>> > > everything (since even with a perfect test suite that would clean
>> > > itself, we could always end in a corrupt state on the system, WRT
>> > > mount, fs, etc).
>> > >
>> > > In turn, this seems to cause trouble on aws, since cloud-init or
>> > > something rename eth0 interface to ens5, without cleaning to the
>> > > network configuration.
>> > >
>> > > So the network init script fail (because the image say "start eth0"
>> > > and
>> > > that's not present), but fail in a weird way. Network is
>> > > initialised
>> > > and working (we can connect), but the dhclient process is not in
>> > > the
>> > > right cgroup, and network.service is in failed state. Restarting
>> > > network didn't work. In turn, this mean that rpc-statd refuse to
>> > > start
>> > > (due to systemd dependencies), which seems to impact various NFS
>> > > tests.
>> > >
>> > > We have also seen that on some builders, rpcbind pick some IP v6
>> > > autoconfiguration, but we can't reproduce that, and there is no ip
>> > > v6
>> > > set up anywhere. I suspect the network.service failure is somehow
>> > > involved, but fail to see how. In turn, rpcbind.socket not starting
>> > > could cause NFS test troubles.
>> > >
>> > > Our current stop gap fix was to fix all the builders one by one.
>> > > Remove
>> > > the config, kill the rogue dhclient, restart network service.
>> > >
>> > > However, we can't be sure this is going to fix the problem long
>> > > term
>> > > since this only manifest after a crash of the test suite, and it
>> > > doesn't happen so often. (plus, it was working before some day in
>> > > the
>> > > past, when something did make this fail, and I do not know if
>> > > that's a
>> > > system upgrade, or a test change, or both).
>> > >
>> > > So we are still looking at it to have a complete understanding of
>> > > the
>> > > issue, but so far, we hacked our way to make it work (or so do I
>> > > think).
>> > >
>> > > Deepshika is working to fix it long term, by fixing the issue
>> > > regarding
>> > > eth0/ens5 with a new base image.
>> > > --
>> > > Michael Scherer
>> > > Sysadmin, Community Infrastructure and Platform, OSAS
>> > >
>> > >
>> > > --
>> >
>> > - Atin (atinm)
>> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Thanks,
Sanju
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190507/eed199b4/attachment.html>