[Gluster-users] BUG: After stop and start wrong port is advertised

Jo Goossens jo.goossens at hosted-power.com
Fri Sep 22 15:28:26 UTC 2017


Hi Darrell,

 
 
Thanks, for us it's really easy to reproduce atm. Each restart or stop/start is causing the issue atm over here.

 
Atin will look into it on monday fortunately :)



Regards

Jo

 
 

 
-----Original message-----
From:Darrell Budic <budic at onholyground.com>
Sent:Fri 22-09-2017 17:24
Subject:Re: [Gluster-users] BUG: After stop and start wrong port is advertised
To:Atin Mukherjee <amukherj at redhat.com>; 
CC:Jo Goossens <jo.goossens at hosted-power.com>; gluster-users at gluster.org; 
I encountered this once in the past, an additional symptom was peers were in disconnected state on the peers that were NOT using the wrong ports. Disconnected peers is how it detected it in the first place.

It happened to me after rebooting, and I fixed it but wasn’t able to stop and gather debugging info on the time.

The problem seemed to be that the volume files in /var/lib/glusterd/vols/<vol-name>//bricks/<server name>\:-v0-<vol name>-brick0 were not updated to reflect a new port # after the restart (and the port numbers had changed to adding and deleting volumes since last start). I stopped glusterd, killed any remaining glusterfsd’s, hand edited the files to reflect the new ports they thought they were running the bricks on (from vol info I think, maybe log files) and restarted glusterd, then everything was happy again.

Hope it helps, sounds like it may be a bug to me too if others are seeing it.

  -Darrell


> On Sep 22, 2017, at 8:10 AM, Atin Mukherjee <amukherj at redhat.com> wrote:
> 
> I've already replied to your earlier email. In case you've not seen it in your mailbox here it goes:
> 
> This looks like a bug to me. For some reason glusterd's portmap is referring to a stale port (IMO) where as brick is still listening to the correct port. But ideally when glusterd service is restarted, all the portmap in-memory is rebuilt. I'd request for the following details from you to let us start analysing it:
> 
> 1. glusterd statedump output from 192.168.140.43 . You can use kill -SIGUSR2 <pid of glusterd> to request for a statedump and the file will be available in /var/run/gluster
> 2. glusterd, brick logfile for 192.168.140.43:/gluster/public from 192.168.140.43
> 3. cmd_history logfile from all the nodes.
> 4. Content of /var/lib/glusterd/vols/public/
> 
> 
> On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at hosted-power.com> wrote:
> Hi,
> 
>  
>  
> We use glusterfs 3.10.5 on Debian 9.
> 
>  
> When we stop or restart the service, e.g.: service glusterfs-server restart
> 
>  
> We see that the wrong port get's advertised afterwards. For example:
> 
>  
> Before restart:
> 
>  
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y       6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y       1483
> Brick 192.168.140.43:/gluster/public        49152     0          Y       5913
> Self-heal Daemon on localhost               N/A       N/A        Y       5932
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y       13084
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y       15499
>  
> Task Status of Volume public
> ------------------------------------------------------------------------------
> There are no active volume tasks
>  
>  
> After restart of the service on one of the nodes (192.168.140.43) the port seems to have changed (but it didn't):
>  
> root at app3:/var/log/glusterfs#  gluster volume status
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y       6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y       1483
> Brick 192.168.140.43:/gluster/public        49154     0          Y       5913
> Self-heal Daemon on localhost               N/A       N/A        Y       4628
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y       3077
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y       28777
>  
> Task Status of Volume public
> ------------------------------------------------------------------------------
> There are no active volume tasks
>  
>  
> However the active process is STILL the same pid AND still listening on the old port
>  
> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      5913/glusterfsd
>  
>  
> The other nodes logs fill up with errors because they can't reach the daemon anymore. They try to reach it on the "new" port instead of the old one:
>  
> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
>  
> So they now try 49154 instead of the old 49152 
>  
> Is this also by design? We had a lot of issues because of this recently. We don't understand why it starts advertising a completely wrong port after stop/start.
>  
>  
>  
>  
>  
> Regards
> 
> Jo Goossens
> 
>  
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170922/085fb7be/attachment.html>


More information about the Gluster-users mailing list