<div><div dir="ltr"><div>I took a quick look at the builders and noticed both have the same error of &#39;Cannot allocate memory&#39; which comes up every time when the builder is rebooted after a build abort. It is happening in the same pattern. Though there&#39;s no such memory consumption on the builders. </div><div dir="auto"><br></div><div dir="auto">I’m investigating more on this.</div></div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 9, 2019 at 10:02 AM Atin Mukherjee &lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee &lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">builder204 needs to be fixed, too many failures, mostly none of the patches are passing regression.<br></div></blockquote><div><br></div><div>And with that builder201 joins the pool, <a href="https://build.gluster.org/job/centos7-regression/5943/consoleFull" target="_blank">https://build.gluster.org/job/centos7-regression/5943/consoleFull</a><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee &lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde &lt;<a href="mailto:srakonde@redhat.com" target="_blank">srakonde@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Deepshikha,<div><br></div><div>I see the failure here[1] which ran on builder206. So, we are good.</div></div></div></blockquote><div><br></div><div>Not really,  <a href="https://build.gluster.org/job/centos7-regression/5909/consoleFull" target="_blank">https://build.gluster.org/job/centos7-regression/5909/consoleFull</a> failed on builder204 for similar reasons I believe?</div><div><br></div><div>I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div><br></div><div>[1] <a href="https://build.gluster.org/job/centos7-regression/5901/consoleFull" target="_blank">https://build.gluster.org/job/centos7-regression/5901/consoleFull</a></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal &lt;<a href="mailto:dkhandel@redhat.com" target="_blank">dkhandel@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Sanju, can you please give us more info about the failures. <br></div><div><br></div><div>I see the failures occurring on just one of the builder (builder206). I&#39;m taking it back offline for now. <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 7, 2019 at 9:42 PM Michael Scherer &lt;<a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit :<br>
&gt; Looks like is_nfs_export_available started failing again in recent<br>
&gt; centos-regressions.<br>
&gt; <br>
&gt; Michael, can you please check?<br>
<br>
I will try but I am leaving for vacation tonight, so if I find nothing,<br>
until I leave, I guess Deepshika will have to look.<br>
<br>
&gt; On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul &lt;<a href="mailto:ykaul@redhat.com" target="_blank">ykaul@redhat.com</a>&gt; wrote:<br>
&gt; <br>
&gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer &lt;<br>
&gt; &gt; <a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>&gt;<br>
&gt; &gt; wrote:<br>
&gt; &gt; <br>
&gt; &gt; &gt; Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :<br>
&gt; &gt; &gt; &gt; Is this back again? The recent patches are failing regression<br>
&gt; &gt; &gt; &gt; :-\ .<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; So, on builder206, it took me a while to find that the issue is<br>
&gt; &gt; &gt; that<br>
&gt; &gt; &gt; nfs (the service) was running.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; ./tests/basic/afr/tarissue.t failed, because the nfs<br>
&gt; &gt; &gt; initialisation<br>
&gt; &gt; &gt; failed with a rather cryptic message:<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; [2019-04-23 13:17:05.371733] I<br>
&gt; &gt; &gt; [socket.c:991:__socket_server_bind] 0-<br>
&gt; &gt; &gt; socket.nfs-server: process started listening on port (38465)<br>
&gt; &gt; &gt; [2019-04-23 13:17:05.385819] E<br>
&gt; &gt; &gt; [socket.c:972:__socket_server_bind] 0-<br>
&gt; &gt; &gt; socket.nfs-server: binding to  failed: Address already in use<br>
&gt; &gt; &gt; [2019-04-23 13:17:05.385843] E<br>
&gt; &gt; &gt; [socket.c:974:__socket_server_bind] 0-<br>
&gt; &gt; &gt; socket.nfs-server: Port is already in use<br>
&gt; &gt; &gt; [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-<br>
&gt; &gt; &gt; socket.nfs-server: __socket_server_bind failed;closing socket 14<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; I found where this came from, but a few stuff did surprised me:<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; - the order of print is different that the order in the code<br>
&gt; &gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; Indeed strange...<br>
&gt; &gt; <br>
&gt; &gt; &gt; - the message on &quot;started listening&quot; didn&#39;t take in account the<br>
&gt; &gt; &gt; fact<br>
&gt; &gt; &gt; that bind failed on:<br>
&gt; &gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; Shouldn&#39;t it bail out if it failed to bind?<br>
&gt; &gt; Some missing &#39;goto out&#39; around line 975/976?<br>
&gt; &gt; Y.<br>
&gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
<a href="https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967" rel="noreferrer" target="_blank">https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967</a><br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; The message about port 38465 also threw me off the track. The<br>
&gt; &gt; &gt; real<br>
&gt; &gt; &gt; issue is that the service nfs was already running, and I couldn&#39;t<br>
&gt; &gt; &gt; find<br>
&gt; &gt; &gt; anything listening on port 38465<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; once I do service nfs stop, it no longer failed.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; So far, I do know why nfs.service was activated.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; But at least, 206 should be fixed, and we know a bit more on what<br>
&gt; &gt; &gt; would<br>
&gt; &gt; &gt; be causing some failure.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; On Wed, 3 Apr 2019 at 19:26, Michael Scherer &lt;<br>
&gt; &gt; &gt; &gt; <a href="mailto:mscherer@redhat.com" target="_blank">mscherer@redhat.com</a>&gt;<br>
&gt; &gt; &gt; &gt; wrote:<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a<br>
&gt; &gt; &gt; &gt; &gt; écrit :<br>
&gt; &gt; &gt; &gt; &gt; &gt; On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan &lt;<br>
&gt; &gt; &gt; &gt; &gt; &gt; <a href="mailto:jthottan@redhat.com" target="_blank">jthottan@redhat.com</a>&gt;<br>
&gt; &gt; &gt; &gt; &gt; &gt; wrote:<br>
&gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi,<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; is_nfs_export_available is just a wrapper around<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &quot;showmount&quot;<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; command AFAIR.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; I saw following messages in console output.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;  mount.nfs: rpc.statd is not running but is required for<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; remote<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; locking.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; 05:06:55 mount.nfs: Either use &#39;-o nolock&#39; to keep locks<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; local,<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; or<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; start<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; statd.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; 05:06:55 mount.nfs: an incorrect mount option was<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; specified<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; For me it looks rpcbind may not be running on the<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; machine.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Usually rpcbind starts automatically on machines, don&#39;t<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; know<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; whether it<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; can happen or not.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; That&#39;s precisely what the question is. Why suddenly we&#39;re<br>
&gt; &gt; &gt; &gt; &gt; &gt; seeing<br>
&gt; &gt; &gt; &gt; &gt; &gt; this<br>
&gt; &gt; &gt; &gt; &gt; &gt; happening too frequently. Today I saw atleast 4 to 5 such<br>
&gt; &gt; &gt; &gt; &gt; &gt; failures<br>
&gt; &gt; &gt; &gt; &gt; &gt; already.<br>
&gt; &gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; &gt; Deepshika - Can you please help in inspecting this?<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; So we think (we are not sure) that the issue is a bit<br>
&gt; &gt; &gt; &gt; &gt; complex.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; What we were investigating was nightly run fail on aws. When<br>
&gt; &gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; build<br>
&gt; &gt; &gt; &gt; &gt; crash, the builder is restarted, since that&#39;s the easiest way<br>
&gt; &gt; &gt; &gt; &gt; to<br>
&gt; &gt; &gt; &gt; &gt; clean<br>
&gt; &gt; &gt; &gt; &gt; everything (since even with a perfect test suite that would<br>
&gt; &gt; &gt; &gt; &gt; clean<br>
&gt; &gt; &gt; &gt; &gt; itself, we could always end in a corrupt state on the system,<br>
&gt; &gt; &gt; &gt; &gt; WRT<br>
&gt; &gt; &gt; &gt; &gt; mount, fs, etc).<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; In turn, this seems to cause trouble on aws, since cloud-init <br>
&gt; &gt; &gt; &gt; &gt; or<br>
&gt; &gt; &gt; &gt; &gt; something rename eth0 interface to ens5, without cleaning to<br>
&gt; &gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; network configuration.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; So the network init script fail (because the image say &quot;start<br>
&gt; &gt; &gt; &gt; &gt; eth0&quot;<br>
&gt; &gt; &gt; &gt; &gt; and<br>
&gt; &gt; &gt; &gt; &gt; that&#39;s not present), but fail in a weird way. Network is<br>
&gt; &gt; &gt; &gt; &gt; initialised<br>
&gt; &gt; &gt; &gt; &gt; and working (we can connect), but the dhclient process is not<br>
&gt; &gt; &gt; &gt; &gt; in<br>
&gt; &gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; right cgroup, and network.service is in failed state.<br>
&gt; &gt; &gt; &gt; &gt; Restarting<br>
&gt; &gt; &gt; &gt; &gt; network didn&#39;t work. In turn, this mean that rpc-statd refuse<br>
&gt; &gt; &gt; &gt; &gt; to<br>
&gt; &gt; &gt; &gt; &gt; start<br>
&gt; &gt; &gt; &gt; &gt; (due to systemd dependencies), which seems to impact various<br>
&gt; &gt; &gt; &gt; &gt; NFS<br>
&gt; &gt; &gt; &gt; &gt; tests.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; We have also seen that on some builders, rpcbind pick some IP<br>
&gt; &gt; &gt; &gt; &gt; v6<br>
&gt; &gt; &gt; &gt; &gt; autoconfiguration, but we can&#39;t reproduce that, and there is<br>
&gt; &gt; &gt; &gt; &gt; no ip<br>
&gt; &gt; &gt; &gt; &gt; v6<br>
&gt; &gt; &gt; &gt; &gt; set up anywhere. I suspect the network.service failure is<br>
&gt; &gt; &gt; &gt; &gt; somehow<br>
&gt; &gt; &gt; &gt; &gt; involved, but fail to see how. In turn, rpcbind.socket not<br>
&gt; &gt; &gt; &gt; &gt; starting<br>
&gt; &gt; &gt; &gt; &gt; could cause NFS test troubles.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; Our current stop gap fix was to fix all the builders one by<br>
&gt; &gt; &gt; &gt; &gt; one.<br>
&gt; &gt; &gt; &gt; &gt; Remove<br>
&gt; &gt; &gt; &gt; &gt; the config, kill the rogue dhclient, restart network service.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; However, we can&#39;t be sure this is going to fix the problem<br>
&gt; &gt; &gt; &gt; &gt; long<br>
&gt; &gt; &gt; &gt; &gt; term<br>
&gt; &gt; &gt; &gt; &gt; since this only manifest after a crash of the test suite, and<br>
&gt; &gt; &gt; &gt; &gt; it<br>
&gt; &gt; &gt; &gt; &gt; doesn&#39;t happen so often. (plus, it was working before some<br>
&gt; &gt; &gt; &gt; &gt; day in<br>
&gt; &gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; past, when something did make this fail, and I do not know if<br>
&gt; &gt; &gt; &gt; &gt; that&#39;s a<br>
&gt; &gt; &gt; &gt; &gt; system upgrade, or a test change, or both).<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; So we are still looking at it to have a complete<br>
&gt; &gt; &gt; &gt; &gt; understanding of<br>
&gt; &gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; issue, but so far, we hacked our way to make it work (or so<br>
&gt; &gt; &gt; &gt; &gt; do I<br>
&gt; &gt; &gt; &gt; &gt; think).<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; Deepshika is working to fix it long term, by fixing the issue<br>
&gt; &gt; &gt; &gt; &gt; regarding<br>
&gt; &gt; &gt; &gt; &gt; eth0/ens5 with a new base image.<br>
&gt; &gt; &gt; &gt; &gt; --<br>
&gt; &gt; &gt; &gt; &gt; Michael Scherer<br>
&gt; &gt; &gt; &gt; &gt; Sysadmin, Community Infrastructure and Platform, OSAS<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; --<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; - Atin (atinm)<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; --<br>
&gt; &gt; &gt; Michael Scherer<br>
&gt; &gt; &gt; Sysadmin, Community Infrastructure<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; _______________________________________________<br>
&gt; &gt; &gt; Gluster-devel mailing list<br>
&gt; &gt; &gt; <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
&gt; &gt; &gt; <a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
&gt; &gt; <br>
&gt; &gt; _______________________________________________<br>
&gt; &gt; Gluster-devel mailing list<br>
&gt; &gt; <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
&gt; &gt; <a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
&gt; <br>
&gt; <br>
&gt; <br>
-- <br>
Michael Scherer<br>
Sysadmin, Community Infrastructure<br>
<br>
<br>
<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a></blockquote></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="m_1813258206301134145gmail-m_-408774628014834163gmail-m_5184492975264877617gmail-m_-6533020112101670479gmail-m_1731302203545716766gmail-m_4461305160343800102gmail-m_3559919090678772387gmail_signature"><div dir="ltr"><div>Thanks,<br></div>Sanju<br></div></div>
_______________________________________________<br>
<br>
Community Meeting Calendar:<br>
<br>
APAC Schedule -<br>
Every 2nd and 4th Tuesday at 11:30 AM IST<br>
Bridge: <a href="https://bluejeans.com/836554017" rel="noreferrer" target="_blank">https://bluejeans.com/836554017</a><br>
<br>
NA/EMEA Schedule -<br>
Every 1st and 3rd Tuesday at 01:00 PM EDT<br>
Bridge: <a href="https://bluejeans.com/486278655" rel="noreferrer" target="_blank">https://bluejeans.com/486278655</a><br>
<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
<br>
</blockquote></div></div></div>
</blockquote></div>
</blockquote></div></div></div>
</blockquote></div>
</div>