[Gluster-users] BUG: After stop and start wrong port is advertised

Tue Jan 23 08:36:07 UTC 2018

3.10 doesn't have this regression, so you're safe.

On Tue, Jan 23, 2018 at 1:28 PM, Jo Goossens <jo.goossens at hosted-power.com>
wrote:

> Hello,
>
>
>
>
>
> Will we also suffer from this regression in any of the (previously) fixed
> 3.10 releases? We kept 3.10 and hope to stay stable :/
>
>
>
> Regards
>
> Jo
>
>
>
>
>
> -----Original message-----
> *From:* Atin Mukherjee <amukherj at redhat.com>
> *Sent:* Tue 23-01-2018 05:15
> *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port is
> advertised
> *To:* Alan Orth <alan.orth at gmail.com>;
> *CC:* Jo Goossens <jo.goossens at hosted-power.com>;
> gluster-users at gluster.org;
> So from the logs what it looks to be a regression caused by commit 635c1c3
> ( and the good news is that this is now fixed in release-3.12 branch and
> should be part of 3.12.5.
>
> Commit which fixes this issue:
>
> COMMIT: https://review.gluster.org/19146 committed in release-3.12 by \"Atin Mukherjee\" <amukherj at redhat.com> with a commit message- glusterd: connect to an existing brick process when qourum status is NOT_APPLICABLE_QUORUM  First of all, this patch reverts commit 635c1c3 as the same is causing a regression with bricks not coming up on time when a node is rebooted. This patch tries to fix the problem in a different way by just trying to connect to an existing running brick when quorum status is not applicable.  >mainline patch : https://review.gluster.org/#/c/19134/  Change-Id: I0efb5901832824b1c15dcac529bffac85173e097 BUG: 1511301 Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
>
>
>
>
> On Mon, Jan 22, 2018 at 3:15 PM, Alan Orth <alan.orth at gmail.com> wrote:
>
> Ouch! Yes, I see two port-related fixes in the GlusterFS 3.12.3 release
> notes[0][1][2]. I've attached a tarball of all yesterday's logs from
> /var/log/glusterd on one the affected nodes (called "wingu3"). I hope
> that's what you need.
>
> [0] https://github.com/gluster/glusterfs/blob/release-3.12/
> doc/release-notes/3.12.3.md
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1507747
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1507748
>
> Thanks,
>
>
> On Mon, Jan 22, 2018 at 6:34 AM Atin Mukherjee <amukherj at redhat.com>
> wrote:
>
> The patch was definitely there in 3.12.3. Do you have the glusterd and
> brick logs handy with you when this happened?
>
> On Sun, Jan 21, 2018 at 10:21 PM, Alan Orth <alan.orth at gmail.com> wrote:
>
> For what it's worth, I just updated some CentOS 7 servers from GlusterFS
> 3.12.1 to 3.12.4 and hit this bug. Did the patch make it into 3.12.4? I had
> to use Mike Hulsman's script to check the daemon port against the port in
> the volume's brick info, update the port, and restart glusterd on each
> node. Luckily I only have four servers! Hoping I don't have to do this
> every time I reboot!
>
> Regards,
>
> On Sat, Dec 2, 2017 at 5:23 PM Atin Mukherjee <amukherj at redhat.com> wrote:
>
> On Sat, 2 Dec 2017 at 19:29, Jo Goossens <jo.goossens at hosted-power.com>
> wrote:
>
> Hello Atin,
>
>
>
>
>
> Could you confirm this should have been fixed in 3.10.8? If so we'll test
> it for sure!
>
>
> Fix should be part of 3.10.8 which is awaiting release announcement.
>
>
>
>
>
>
> Regards
>
> Jo
>
>
>
>
>
>
> -----Original message-----
> *From:* Atin Mukherjee <amukherj at redhat.com>
>
> *Sent:* Mon 30-10-2017 17:40
> *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port is
> advertised
> *To:* Jo Goossens <jo.goossens at hosted-power.com>;
> *CC:* gluster-users at gluster.org;
>
> On Sat, 28 Oct 2017 at 02:36, Jo Goossens <jo.goossens at hosted-power.com>
> wrote:
>
> Hello Atin,
>
>
>
>
>
> I just read it and very happy you found the issue. We really hope this
> will be fixed in the next 3.10.7 version!
>
>
> 3.10.7 - no I guess as the patch is still in review and 3.10.7 is getting
> tagged today. You’ll get this fix in 3.10.8.
>
>
>
>
>
>
>
>
> PS: Wow nice all that c code and those "goto out" statements (not always
> considered clean but the best way often I think). Can remember the days I
> wrote kernel drivers myself in c :)
>
>
>
>
>
> Regards
>
> Jo Goossens
>
>
>
>
>
>
>
>
> -----Original message-----
> *From:* Atin Mukherjee <amukherj at redhat.com>
> *Sent:* Fri 27-10-2017 21:01
> *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port is
> advertised
> *To:* Jo Goossens <jo.goossens at hosted-power.com>;
> *CC:* gluster-users at gluster.org;
>
> We (finally) figured out the root cause, Jo!
>
> Patch https://review.gluster.org/#/c/18579 posted upstream for review.
>
> On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens at hosted-power.com>
> wrote:
>
> Hi,
>
>
>
>
>
> We use glusterfs 3.10.5 on Debian 9.
>
>
>
> When we stop or restart the service, e.g.: service glusterfs-server restart
>
>
>
> We see that the wrong port get's advertised afterwards. For example:
>
>
>
> Before restart:
>
>
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49152     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 5932
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 13084
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 15499
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> After restart of the service on one of the nodes (192.168.140.43) the port
> seems to have changed (but it didn't):
>
> root at app3:/var/log/glusterfs#  gluster volume status
> Status of volume: public
> Gluster process                             TCP Port  RDMA Port  Online
>  Pid
> ------------------------------------------------------------
> ------------------
> Brick 192.168.140.41:/gluster/public        49153     0          Y
> 6364
> Brick 192.168.140.42:/gluster/public        49152     0          Y
> 1483
> Brick 192.168.140.43:/gluster/public        49154     0          Y
> 5913
> Self-heal Daemon on localhost               N/A       N/A        Y
> 4628
> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
> 3077
> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
> 28777
>
> Task Status of Volume public
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> However the active process is STILL the same pid AND still listening on
> the old port
>
> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*
> LISTEN      5913/glusterfsd
>
>
> The other nodes logs fill up with errors because they can't reach the
> daemon anymore. They try to reach it on the "new" port instead of the old
> one:
>
> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
> 0-public-client-2: changing port to 49154 (from 0)
> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
> 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection
> refused); disconnecting socket
>
> So they now try 49154 instead of the old 49152
>
> Is this also by design? We had a lot of issues because of this recently.
> We don't understand why it starts advertising a completely wrong port after
> stop/start.
>
>
>
>
>
>
>
> Regards
>
> Jo Goossens
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
> --
> - Atin (atinm)
>
> --
> - Atin (atinm)
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> --
>
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
>
>
>
> --
>
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180123/29c1943f/attachment.html>