[Gluster-users] BUG: After stop and start wrong port is advertised
Diego Remolina
dijuremo at gmail.com
Fri Sep 22 15:32:07 UTC 2017
I've noticed this as well on the official 3.8.4 gluster packages from Red Hat
# gluster v status
Status of volume: aevmstorage
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick ae-vmstore01-rbe:/bricks/brick1/brick 49153 0 Y 2659
Brick ae-vmstore02-rbe:/bricks/brick1/brick 49152 0 Y 2651
Brick ae-vmstore03-rbe:/bricks/brick1/brick 49153 0 Y 72876
NFS Server on localhost 2049 0 Y 3389
Self-heal Daemon on localhost N/A N/A Y 3398
NFS Server on ae-vmstore02-rbe 2049 0 Y 2675
Self-heal Daemon on ae-vmstore02-rbe N/A N/A Y 2848
NFS Server on ae-vmstore03-rbe 2049 0 Y 156988
Self-heal Daemon on ae-vmstore03-rbe N/A N/A Y 156996
Task Status of Volume aevmstorage
------------------------------------------------------------------------------
There are no active volume tasks
Deigo
On Fri, Sep 22, 2017 at 11:28 AM, Jo Goossens
<jo.goossens at hosted-power.com> wrote:
> Hi Darrell,
>
>
>
>
>
> Thanks, for us it's really easy to reproduce atm. Each restart or stop/start
> is causing the issue atm over here.
>
>
>
> Atin will look into it on monday fortunately :)
>
>
>
> Regards
>
> Jo
>
>
>
>
>
>
>
>
> -----Original message-----
> From: Darrell Budic <budic at onholyground.com>
> Sent: Fri 22-09-2017 17:24
> Subject: Re: [Gluster-users] BUG: After stop and start wrong port is
> advertised
> To: Atin Mukherjee <amukherj at redhat.com>;
> CC: Jo Goossens <jo.goossens at hosted-power.com>; gluster-users at gluster.org;
> I encountered this once in the past, an additional symptom was peers were in
> disconnected state on the peers that were NOT using the wrong ports.
> Disconnected peers is how it detected it in the first place.
>
> It happened to me after rebooting, and I fixed it but wasn’t able to stop
> and gather debugging info on the time.
>
> The problem seemed to be that the volume files in
> /var/lib/glusterd/vols/<vol-name>//bricks/<server name>\:-v0-<vol
> name>-brick0 were not updated to reflect a new port # after the restart (and
> the port numbers had changed to adding and deleting volumes since last
> start). I stopped glusterd, killed any remaining glusterfsd’s, hand edited
> the files to reflect the new ports they thought they were running the bricks
> on (from vol info I think, maybe log files) and restarted glusterd, then
> everything was happy again.
>
> Hope it helps, sounds like it may be a bug to me too if others are seeing
> it.
>
> -Darrell
>
>
>> On Sep 22, 2017, at 8:10 AM, Atin Mukherjee <amukherj at redhat.com> wrote:
>>
>> I've already replied to your earlier email. In case you've not seen it in
>> your mailbox here it goes:
>>
>> This looks like a bug to me. For some reason glusterd's portmap is
>> referring to a stale port (IMO) where as brick is still listening to the
>> correct port. But ideally when glusterd service is restarted, all the
>> portmap in-memory is rebuilt. I'd request for the following details from you
>> to let us start analysing it:
>>
>> 1. glusterd statedump output from 192.168.140.43 . You can use kill
>> -SIGUSR2 <pid of glusterd> to request for a statedump and the file will be
>> available in /var/run/gluster
>> 2. glusterd, brick logfile for 192.168.140.43:/gluster/public from
>> 192.168.140.43
>> 3. cmd_history logfile from all the nodes.
>> 4. Content of /var/lib/glusterd/vols/public/
>>
>>
>> On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens
>> <jo.goossens at hosted-power.com> wrote:
>> Hi,
>>
>>
>>
>> We use glusterfs 3.10.5 on Debian 9.
>>
>>
>> When we stop or restart the service, e.g.: service glusterfs-server
>> restart
>>
>>
>> We see that the wrong port get's advertised afterwards. For example:
>>
>>
>> Before restart:
>>
>>
>> Status of volume: public
>> Gluster process TCP Port RDMA Port Online
>> Pid
>>
>> ------------------------------------------------------------------------------
>> Brick 192.168.140.41:/gluster/public 49153 0 Y
>> 6364
>> Brick 192.168.140.42:/gluster/public 49152 0 Y
>> 1483
>> Brick 192.168.140.43:/gluster/public 49152 0 Y
>> 5913
>> Self-heal Daemon on localhost N/A N/A Y
>> 5932
>> Self-heal Daemon on 192.168.140.42 N/A N/A Y
>> 13084
>> Self-heal Daemon on 192.168.140.41 N/A N/A Y
>> 15499
>>
>> Task Status of Volume public
>>
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>
>>
>> After restart of the service on one of the nodes (192.168.140.43) the port
>> seems to have changed (but it didn't):
>>
>> root at app3:/var/log/glusterfs# gluster volume status
>> Status of volume: public
>> Gluster process TCP Port RDMA Port Online
>> Pid
>>
>> ------------------------------------------------------------------------------
>> Brick 192.168.140.41:/gluster/public 49153 0 Y
>> 6364
>> Brick 192.168.140.42:/gluster/public 49152 0 Y
>> 1483
>> Brick 192.168.140.43:/gluster/public 49154 0 Y
>> 5913
>> Self-heal Daemon on localhost N/A N/A Y
>> 4628
>> Self-heal Daemon on 192.168.140.42 N/A N/A Y
>> 3077
>> Self-heal Daemon on 192.168.140.41 N/A N/A Y
>> 28777
>>
>> Task Status of Volume public
>>
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>
>>
>> However the active process is STILL the same pid AND still listening on
>> the old port
>>
>> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
>> tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN
>> 5913/glusterfsd
>>
>>
>> The other nodes logs fill up with errors because they can't reach the
>> daemon anymore. They try to reach it on the "new" port instead of the old
>> one:
>>
>> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
>> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
>> refused); disconnecting socket
>> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>> 0-public-client-2: changing port to 49154 (from 0)
>> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
>> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
>> refused); disconnecting socket
>> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>> 0-public-client-2: changing port to 49154 (from 0)
>> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
>> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
>> refused); disconnecting socket
>> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>> 0-public-client-2: changing port to 49154 (from 0)
>> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
>> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
>> refused); disconnecting socket
>> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>> 0-public-client-2: changing port to 49154 (from 0)
>> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
>> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
>> refused); disconnecting socket
>>
>> So they now try 49154 instead of the old 49152
>>
>> Is this also by design? We had a lot of issues because of this recently.
>> We don't understand why it starts advertising a completely wrong port after
>> stop/start.
>>
>>
>>
>>
>>
>> Regards
>>
>> Jo Goossens
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
More information about the Gluster-users
mailing list