[Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t

Thu Jul 2 15:49:05 UTC 2015

This is caused because when bind-insecure is turned on (which is the default now), it may happen
that brick is not able to bind to port assigned by Glusterd for example 49192-49195...
It seems to occur because the rpc_clnt connections are binding to ports in the same range. 
so brick fails to bind to a port which is already used by someone else.

This bug already exist before http://review.gluster.org/#/c/11039/ when use rdma, i.e. even
previously rdma binds to port >= 1024 if it cannot find a free port < 1024,
even when bind insecure was turned off (ref to commit '0e3fd04e').
Since we don't have tests related to rdma we did not discover this issue previously.

http://review.gluster.org/#/c/11039/ discovers the bug we encountered, however now the bug can be fixed by
http://review.gluster.org/#/c/11512/ by making rpc_clnt to get port numbers from 65535 in a descending
order, as a result port clash is minimized, also it fixes issues in rdma too

Thanks to Raghavendra Talur for help in discovering the real cause

Regards,
Prasanna Kalever

----- Original Message -----
From: "Raghavendra Talur" <raghavendra.talur at gmail.com>
To: "Krishnan Parthasarathi" <kparthas at redhat.com>
Cc: "Gluster Devel" <gluster-devel at gluster.org>
Sent: Thursday, July 2, 2015 6:45:17 PM
Subject: Re: [Gluster-devel] spurious failures	tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t

On Thu, Jul 2, 2015 at 4:40 PM, Raghavendra Talur < raghavendra.talur at gmail.com > wrote: 

On Thu, Jul 2, 2015 at 10:52 AM, Krishnan Parthasarathi < kparthas at redhat.com > wrote: 

> > 
> > A port assigned by Glusterd for a brick is found to be in use already by 
> > the brick. Any changes in Glusterd recently which can cause this? 
> > 
> > Or is it a test infra problem? 

This issue is likely to be caused by http://review.gluster.org/11039 
This patch changes the port allocation that happens for rpc_clnt based 
connections. Previously, ports allocated where < 1024. With this change, 
these connections, typically mount process, gluster-nfs server processes 
etc could end up using ports that bricks are being assigned to. 

IIUC, the intention of the patch was to make server processes lenient to 
inbound messages from ports > 1024. If we don't require to use ports > 1024 
we could leave the port allocation for rpc_clnt connections as before. 
Alternately, we could reserve the range of ports starting from 49152 for bricks 
by setting net.ipv4.ip_local_reserved_ports using sysctl(8). This is specific to Linux. 
I'm not aware of how this could be done in NetBSD for instance though. 

It seems this is exactly whats happening. 

I have a question, I get the following data from netstat and grep 

tcp 0 0 f6be17c0fbf5:1023 f6be17c0fbf5:24007 ESTABLISHED 31516/glusterfsd 
tcp 0 0 f6be17c0fbf5:49152 f6be17c0fbf5:490 ESTABLISHED 31516/glusterfsd 
unix 3 [ ] STREAM CONNECTED 988353 31516/glusterfsd /var/run/gluster/4878d6e905c5f6032140a00cc584df8a.socket 

Here 31516 is the brick pid. 

Looking at the data, line 2 is very clear, it shows connection between brick and glusterfs client. 
unix socket on line 3 is also clear, it is the unix socket connection that glusterd and brick process use for communication. 

I am not able to understand line 1; which part of brick process established a tcp connection with glusterd using port 1023? 
Note: this data is from a build which does not have the above mentioned patch. 

The patch which exposed this bug is being reverted till the underlying bug is also fixed. 
You can monitor revert patches here 
master: http://review.gluster.org/11507 
3.7 branch: http://review.gluster.org/11508 

Please rebase your patches after the above patches are merged to ensure that you patches pass regression. 

-- 
Raghavendra Talur 

_______________________________________________
Gluster-devel mailing list
Gluster-devel at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel