<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <<a href="mailto:mscherer@redhat.com">mscherer@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :<br>
> Is this back again? The recent patches are failing regression :-\ .<br>
<br>
So, on builder206, it took me a while to find that the issue is that<br>
nfs (the service) was running.<br>
<br>
./tests/basic/afr/tarissue.t failed, because the nfs initialisation<br>
failed with a rather cryptic message:<br>
<br>
[2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-<br>
socket.nfs-server: process started listening on port (38465)<br>
[2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-<br>
socket.nfs-server: binding to failed: Address already in use<br>
[2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-<br>
socket.nfs-server: Port is already in use<br>
[2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-<br>
socket.nfs-server: __socket_server_bind failed;closing socket 14<br>
<br>
I found where this came from, but a few stuff did surprised me:<br>
<br>
- the order of print is different that the order in the code<br></blockquote><div><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Indeed strange...</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
- the message on "started listening" didn't take in account the fact<br>
that bind failed on:<br></blockquote><div><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Shouldn't it bail out if it failed to bind?</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Some missing 'goto out' around line 975/976?<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Y.</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
<a href="https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967" rel="noreferrer" target="_blank">https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967</a><br>
<br>
The message about port 38465 also threw me off the track. The real<br>
issue is that the service nfs was already running, and I couldn't find<br>
anything listening on port 38465<br>
<br>
once I do service nfs stop, it no longer failed.<br>
<br>
So far, I do know why nfs.service was activated.<br>
<br>
But at least, 206 should be fixed, and we know a bit more on what would<br>
be causing some failure.<br>
<br>
<br>
<br>
> On Wed, 3 Apr 2019 at 19:26, Michael Scherer <<a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>><br>
> wrote:<br>
> <br>
> > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :<br>
> > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <<br>
> > > <a href="mailto:jthottan@redhat.com" target="_blank">jthottan@redhat.com</a>><br>
> > > wrote:<br>
> > > <br>
> > > > Hi,<br>
> > > > <br>
> > > > is_nfs_export_available is just a wrapper around "showmount"<br>
> > > > command AFAIR.<br>
> > > > I saw following messages in console output.<br>
> > > > mount.nfs: rpc.statd is not running but is required for remote<br>
> > > > locking.<br>
> > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,<br>
> > > > or<br>
> > > > start<br>
> > > > statd.<br>
> > > > 05:06:55 mount.nfs: an incorrect mount option was specified<br>
> > > > <br>
> > > > For me it looks rpcbind may not be running on the machine.<br>
> > > > Usually rpcbind starts automatically on machines, don't know<br>
> > > > whether it<br>
> > > > can happen or not.<br>
> > > > <br>
> > > <br>
> > > That's precisely what the question is. Why suddenly we're seeing<br>
> > > this<br>
> > > happening too frequently. Today I saw atleast 4 to 5 such<br>
> > > failures<br>
> > > already.<br>
> > > <br>
> > > Deepshika - Can you please help in inspecting this?<br>
> > <br>
> > So we think (we are not sure) that the issue is a bit complex.<br>
> > <br>
> > What we were investigating was nightly run fail on aws. When the<br>
> > build<br>
> > crash, the builder is restarted, since that's the easiest way to<br>
> > clean<br>
> > everything (since even with a perfect test suite that would clean<br>
> > itself, we could always end in a corrupt state on the system, WRT<br>
> > mount, fs, etc).<br>
> > <br>
> > In turn, this seems to cause trouble on aws, since cloud-init or<br>
> > something rename eth0 interface to ens5, without cleaning to the<br>
> > network configuration.<br>
> > <br>
> > So the network init script fail (because the image say "start eth0"<br>
> > and<br>
> > that's not present), but fail in a weird way. Network is<br>
> > initialised<br>
> > and working (we can connect), but the dhclient process is not in<br>
> > the<br>
> > right cgroup, and network.service is in failed state. Restarting<br>
> > network didn't work. In turn, this mean that rpc-statd refuse to<br>
> > start<br>
> > (due to systemd dependencies), which seems to impact various NFS<br>
> > tests.<br>
> > <br>
> > We have also seen that on some builders, rpcbind pick some IP v6<br>
> > autoconfiguration, but we can't reproduce that, and there is no ip<br>
> > v6<br>
> > set up anywhere. I suspect the network.service failure is somehow<br>
> > involved, but fail to see how. In turn, rpcbind.socket not starting<br>
> > could cause NFS test troubles.<br>
> > <br>
> > Our current stop gap fix was to fix all the builders one by one.<br>
> > Remove<br>
> > the config, kill the rogue dhclient, restart network service.<br>
> > <br>
> > However, we can't be sure this is going to fix the problem long<br>
> > term<br>
> > since this only manifest after a crash of the test suite, and it<br>
> > doesn't happen so often. (plus, it was working before some day in<br>
> > the<br>
> > past, when something did make this fail, and I do not know if<br>
> > that's a<br>
> > system upgrade, or a test change, or both).<br>
> > <br>
> > So we are still looking at it to have a complete understanding of<br>
> > the<br>
> > issue, but so far, we hacked our way to make it work (or so do I<br>
> > think).<br>
> > <br>
> > Deepshika is working to fix it long term, by fixing the issue<br>
> > regarding<br>
> > eth0/ens5 with a new base image.<br>
> > --<br>
> > Michael Scherer<br>
> > Sysadmin, Community Infrastructure and Platform, OSAS<br>
> > <br>
> > <br>
> > --<br>
> <br>
> - Atin (atinm)<br>
-- <br>
Michael Scherer<br>
Sysadmin, Community Infrastructure<br>
<br>
<br>
<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a></blockquote></div></div>