[Gluster-users] BUG: After stop and start wrong port is advertised

Tue Jan 23 04:14:52 UTC 2018

So from the logs what it looks to be a regression caused by commit 635c1c3
( and the good news is that this is now fixed in release-3.12 branch and
should be part of 3.12.5.

Commit which fixes this issue:

COMMIT: https://review.gluster.org/19146 committed in release-3.12 by
\"Atin Mukherjee\" <amukherj at redhat.com> with a commit message-
glusterd: connect to an existing brick process when qourum status is
NOT_APPLICABLE_QUORUM

First of all, this patch reverts commit 635c1c3 as the same is causing a
regression with bricks not coming up on time when a node is rebooted.
This patch tries to fix the problem in a different way by just trying to
connect to an existing running brick when quorum status is not
applicable.
>mainline patch : https://review.gluster.org/#/c/19134/

Change-Id: I0efb5901832824b1c15dcac529bffac85173e097
BUG: 1511301
Signed-off-by: Atin Mukherjee <amukherj at redhat.com>

On Mon, Jan 22, 2018 at 3:15 PM, Alan Orth <alan.orth at gmail.com> wrote:

> Ouch! Yes, I see two port-related fixes in the GlusterFS 3.12.3 release
> notes[0][1][2]. I've attached a tarball of all yesterday's logs from
> /var/log/glusterd on one the affected nodes (called "wingu3"). I hope
> that's what you need.
>
> [0] https://github.com/gluster/glusterfs/blob/release-3.12/
> doc/release-notes/3.12.3.md
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1507747
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1507748
>
> Thanks,
>
>
> On Mon, Jan 22, 2018 at 6:34 AM Atin Mukherjee <amukherj at redhat.com>
> wrote:
>
>> The patch was definitely there in 3.12.3. Do you have the glusterd and
>> brick logs handy with you when this happened?
>>
>> On Sun, Jan 21, 2018 at 10:21 PM, Alan Orth <alan.orth at gmail.com> wrote:
>>
>>> For what it's worth, I just updated some CentOS 7 servers from GlusterFS
>>> 3.12.1 to 3.12.4 and hit this bug. Did the patch make it into 3.12.4? I had
>>> to use Mike Hulsman's script to check the daemon port against the port in
>>> the volume's brick info, update the port, and restart glusterd on each
>>> node. Luckily I only have four servers! Hoping I don't have to do this
>>> every time I reboot!
>>>
>>> Regards,
>>>
>>> On Sat, Dec 2, 2017 at 5:23 PM Atin Mukherjee <amukherj at redhat.com>
>>> wrote:
>>>
>>>> On Sat, 2 Dec 2017 at 19:29, Jo Goossens <jo.goossens at hosted-power.com>
>>>> wrote:
>>>>
>>>>> Hello Atin,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Could you confirm this should have been fixed in 3.10.8? If so we'll
>>>>> test it for sure!
>>>>>
>>>>
>>>> Fix should be part of 3.10.8 which is awaiting release announcement.
>>>>
>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>> *From:* Atin Mukherjee <amukherj at redhat.com>
>>>>>
>>>>> *Sent:* Mon 30-10-2017 17:40
>>>>> *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port
>>>>> is advertised
>>>>> *To:* Jo Goossens <jo.goossens at hosted-power.com>;
>>>>> *CC:* gluster-users at gluster.org;
>>>>>
>>>>> On Sat, 28 Oct 2017 at 02:36, Jo Goossens <
>>>>> jo.goossens at hosted-power.com> wrote:
>>>>>
>>>>> Hello Atin,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I just read it and very happy you found the issue. We really hope this
>>>>> will be fixed in the next 3.10.7 version!
>>>>>
>>>>>
>>>>> 3.10.7 - no I guess as the patch is still in review and 3.10.7 is
>>>>> getting tagged today. You’ll get this fix in 3.10.8.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> PS: Wow nice all that c code and those "goto out" statements (not
>>>>> always considered clean but the best way often I think). Can remember the
>>>>> days I wrote kernel drivers myself in c :)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jo Goossens
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>> *From:* Atin Mukherjee <amukherj at redhat.com>
>>>>> *Sent:* Fri 27-10-2017 21:01
>>>>> *Subject:* Re: [Gluster-users] BUG: After stop and start wrong port
>>>>> is advertised
>>>>> *To:* Jo Goossens <jo.goossens at hosted-power.com>;
>>>>> *CC:* gluster-users at gluster.org;
>>>>>
>>>>> We (finally) figured out the root cause, Jo!
>>>>>
>>>>> Patch https://review.gluster.org/#/c/18579 posted upstream for review.
>>>>>
>>>>> On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <
>>>>> jo.goossens at hosted-power.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> We use glusterfs 3.10.5 on Debian 9.
>>>>>
>>>>>
>>>>>
>>>>> When we stop or restart the service, e.g.: service glusterfs-server
>>>>> restart
>>>>>
>>>>>
>>>>>
>>>>> We see that the wrong port get's advertised afterwards. For example:
>>>>>
>>>>>
>>>>>
>>>>> Before restart:
>>>>>
>>>>>
>>>>> Status of volume: public
>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>  Online  Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick 192.168.140.41:/gluster/public        49153     0          Y
>>>>>     6364
>>>>> Brick 192.168.140.42:/gluster/public        49152     0          Y
>>>>>     1483
>>>>> Brick 192.168.140.43:/gluster/public        49152     0          Y
>>>>>     5913
>>>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>>>>   5932
>>>>> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
>>>>>   13084
>>>>> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
>>>>>   15499
>>>>>
>>>>> Task Status of Volume public
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>>
>>>>> After restart of the service on one of the nodes (192.168.140.43) the
>>>>> port seems to have changed (but it didn't):
>>>>>
>>>>> root at app3:/var/log/glusterfs#  gluster volume status
>>>>> Status of volume: public
>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>  Online  Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick 192.168.140.41:/gluster/public        49153     0          Y
>>>>>     6364
>>>>> Brick 192.168.140.42:/gluster/public        49152     0          Y
>>>>>     1483
>>>>> Brick 192.168.140.43:/gluster/public        49154     0          Y
>>>>>     5913
>>>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>>>>   4628
>>>>> Self-heal Daemon on 192.168.140.42          N/A       N/A        Y
>>>>>   3077
>>>>> Self-heal Daemon on 192.168.140.41          N/A       N/A        Y
>>>>>   28777
>>>>>
>>>>> Task Status of Volume public
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>>
>>>>> However the active process is STILL the same pid AND still listening
>>>>> on the old port
>>>>>
>>>>> root at 192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
>>>>> tcp        0      0 0.0.0.0:49152           0.0.0.0:*
>>>>> LISTEN      5913/glusterfsd
>>>>>
>>>>>
>>>>> The other nodes logs fill up with errors because they can't reach the
>>>>> daemon anymore. They try to reach it on the "new" port instead of the old
>>>>> one:
>>>>>
>>>>> [2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish]
>>>>> 0-public-client-2: connection to 192.168.140.43:49154 failed
>>>>> (Connection refused); disconnecting socket
>>>>> [2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>>>>> 0-public-client-2: changing port to 49154 (from 0)
>>>>> [2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish]
>>>>> 0-public-client-2: connection to 192.168.140.43:49154 failed
>>>>> (Connection refused); disconnecting socket
>>>>> [2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>>>>> 0-public-client-2: changing port to 49154 (from 0)
>>>>> [2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish]
>>>>> 0-public-client-2: connection to 192.168.140.43:49154 failed
>>>>> (Connection refused); disconnecting socket
>>>>> [2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>>>>> 0-public-client-2: changing port to 49154 (from 0)
>>>>> [2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish]
>>>>> 0-public-client-2: connection to 192.168.140.43:49154 failed
>>>>> (Connection refused); disconnecting socket
>>>>> [2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
>>>>> 0-public-client-2: changing port to 49154 (from 0)
>>>>> [2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish]
>>>>> 0-public-client-2: connection to 192.168.140.43:49154 failed
>>>>> (Connection refused); disconnecting socket
>>>>>
>>>>> So they now try 49154 instead of the old 49152
>>>>>
>>>>> Is this also by design? We had a lot of issues because of this
>>>>> recently. We don't understand why it starts advertising a completely wrong
>>>>> port after stop/start.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jo Goossens
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>> --
>>>>> - Atin (atinm)
>>>>>
>>>>> --
>>>> - Atin (atinm)
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>> --
>>>
>>> Alan Orth
>>> alan.orth at gmail.com
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>>
>>
>>
>
> --
>
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180123/3844af90/attachment.html>