[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

Sat Apr 9 04:27:22 UTC 2016

-Atin
Sent from one plus one
On 08-Apr-2016 10:40 pm, "Ernie Dunbar" <maillist at lightspeed.ca> wrote:
>
> On 2016-04-07 09:16, Atin Mukherjee wrote:
>>
>> -Atin
>> Sent from one plus one
>> On 07-Apr-2016 9:32 pm, "Ernie Dunbar" <maillist at lightspeed.ca> wrote:
>>>
>>>
>>> On 2016-04-06 21:20, Atin Mukherjee wrote:
>>>>
>>>>
>>>> On 04/07/2016 04:04 AM, Ernie Dunbar wrote:
>>>>>
>>>>>
>>>>> On 2016-04-06 11:42, Ernie Dunbar wrote:
>>>>>>
>>>>>>
>>>>>> I've already successfully created a Gluster cluster, but when I
>>
>> try to
>>>>>>
>>>>>> add a new node, gluster on the new node claims it can't find the
>>>>>> hostname of the first node in the cluster.
>>>>>>
>>>>>> I've added the hostname nfs1.lightspeed.ca [1] to /etc/hosts like
>>
>> this:
>>>>>>
>>>>>>
>>>>>> root at nfs3:/home/ernied# cat /etc/hosts
>>>>>> 127.0.0.1    localhost
>>>>>> 192.168.1.31    nfs1.lightspeed.ca [1]      nfs1
>>>>>> 192.168.1.32    nfs2.lightspeed.ca [2]      nfs2
>>>>>> 127.0.1.1    nfs3.lightspeed.ca [3]    nfs3
>>>>>>
>>>>>>
>>>>>>
>>>>>> # The following lines are desirable for IPv6 capable hosts
>>>>>> ::1     localhost ip6-localhost ip6-loopback
>>>>>> ff02::1 ip6-allnodes
>>>>>> ff02::2 ip6-allrouters
>>>>>>
>>>>>> I can ping the hostname:
>>>>>>
>>>>>> root at nfs3:/home/ernied# ping -c 3 nfs1
>>>>>> PING nfs1.lightspeed.ca [1] (192.168.1.31) 56(84) bytes of data.
>>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=1
>>
>> ttl=64
>>>>>>
>>>>>> time=0.148 ms
>>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=2
>>
>> ttl=64
>>>>>>
>>>>>> time=0.126 ms
>>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=3
>>
>> ttl=64
>>>>>>
>>>>>> time=0.133 ms
>>>>>>
>>>>>> --- nfs1.lightspeed.ca [1] ping statistics ---
>>>>>>
>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms
>>>>>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms
>>>>>>
>>>>>> I can get gluster to probe the hostname:
>>>>>>
>>>>>> root at nfs3:/home/ernied# gluster peer probe nfs1
>>>>>> peer probe: success. Host nfs1 port 24007 already in peer list
>>>>>>
>>>>>> But if I try to create the brick on the new node, it says that
>>
>> the
>>>>>>
>>>>>> host can't be found? Um...
>>>>>>
>>>>>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3
>>>>>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/
>>>>>> nfs3.lightspeed.ca:/brick1/gv2
>>>>>> volume create: gv2: failed: Failed to find host
>>
>> nfs1.lightspeed.ca [1]
>>>>>>
>>>>>>
>>>>>> Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log:
>>>>>>
>>>>>> [2016-04-06 18:19:18.107459] E [MSGID: 106452]
>>>>>> [glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management:
>>>>>> Failed to find host nfs1.lightspeed.ca [1]
>>>>>>
>>>>>> [2016-04-06 18:19:18.107496] E [MSGID: 106536]
>>>>>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume]
>>>>>> 0-management: Failed to find host nfs1.lightspeed.ca [1]
>>>>>>
>>>>>> [2016-04-06 18:19:18.107516] E [MSGID: 106301]
>>>>>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging
>>
>> of
>>>>>>
>>>>>> operation 'Volume Create' failed on localhost : Failed to find
>>
>> host
>>>>>>
>>>>>> nfs1.lightspeed.ca [1]
>>>>>>
>>>>>> [2016-04-06 18:19:18.231864] E [MSGID: 106170]
>>>>>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req]
>>
>> 0-management:
>>>>>>
>>>>>> Request from peer 192.168.1.31:65530 [4] has an entry in
>>
>> peerinfo, but
>>>>>>
>>>>>> uuid does not match
>>>>
>>>>
>>>> We have introduced a new check to reject a peer if the request is
>>
>> coming
>>>>
>>>> from a node where the hostname matches but UUID is different. This
>>
>> can
>>>>
>>>> happen if a node goes through a re-installation and its
>>>> /var/lib/glusterd/* content is wiped off. Look at [1] for more
>>
>> details.
>>>>
>>>>
>>>> [1] http://review.gluster.org/13519
>>>>
>>>> Do confirm if that's the case.
>>>
>>>
>>>
>>>
>>> I couldn't say if that's *exactly* the case, but it's pretty close.
>>
>> I don't recall ever removing /var/lib/glusterd/* or any of its
>> contents, but the operating system isn't exactly the way it was when I
>> first tried to add this node to the cluster.
>>>
>>>
>>> What should I do to *fix* the problem though, so I can add this node
>>
>> to the cluster? This bug report doesn't appear to provide a solution.
>> I've tried removing the node from the cluster, and that failed too.
>> Things seem to be in a very screwey state right now.
>>
>> I should have given the work around earlier. Find the peer file for
>> the faulty node in /var/lib/glusterd/peers/ and delete the same from
>> all the nodes but the faulty node. Restart glusterd instance on all
>> those nodes. Ensure /var/lib/glusterd/ content is empty, restart
>> glusterd and then peer probe this node from any of the node in the
>> existing cluster. You should also bump up the op-version once cluster
>> is stable.
>>
>
> This mostly solved the problem, but it seems you were missing one step:
>
> # gluster peer detach <wonky node>

Not really, if you would have cleared the peer file from the backend from
all of the nodes and post restart of glusterd instances the cluster
shouldn't have detected this faulty node.
>
> After probing the new node again, I was able to add it to the cluster.
Without doing this step,
> attempting to add the new node to the cluster just resulted in this error
message:
>
> volume create: gv0: failed: Host 192.168.1.33 is not in 'Peer in Cluster'
state
>
>
>>>
>>>
>>>>
>>>>>> [2016-04-06 18:19:18.231919] E [MSGID: 106170]
>>>>>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req]
>>
>> 0-management:
>>>>>>
>>>>>> Rejecting management handshake request from unknown peer
>>>>>> 192.168.1.31:65530 [4]
>>>>>>
>>>>>>
>>>>>> That error about the entry in peerinfo doesn't match anything in
>>>>>> Google besides the source code for Gluster. My guess is that my
>>>>>> earlier unsuccessful attempts to add this node before v3.7.10
>>
>> have
>>>>>>
>>>>>> created a conflict that needs to be cleared.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> More interesting, is what happens when I try to add the third
>>
>> server to
>>>>>
>>>>> the brick from the first gluster server:
>>>>>
>>>>> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3
>>>>> nfs3:/brick1/gv2
>>>>> volume add-brick: failed: One or more nodes do not support the
>>
>> required
>>>>>
>>>>> op-version. Cluster op-version must atleast be 30600.
>>>>>
>>>>> Yet, when I view the operating version in
>>
>> /var/lib/glusterd/glusterd.info [5]:
>>>>>
>>>>>
>>>>> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>>> UUID=1207917a-23bc-4bae-8238-cd691b7082c7
>>>>> operating-version=30501
>>>>>
>>>>> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>>> UUID=e394fcec-41da-482a-9b30-089f717c5c06
>>>>> operating-version=30501
>>>>>
>>>>> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>>>
>>>>> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed
>>>>> operating-version=30501
>>>>>
>>>>> I see that the operating version is the same on all nodes!
>>>>
>>>>
>>>> Here cluster op-version is pretty old. You need to make sure that
>>
>> you
>>>>
>>>> bump up the op-version by 'gluster volume set all
>>
>> cluster.op-version
>>>>
>>>> 30710'. add-brick code path has a check that your cluster
>>
>> op-version has
>>>>
>>>> to be at least 30600 if you are with gluster version >=3.6 which is
>>
>> the
>>>>
>>>> case here.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>> Links:
>> ------
>> [1] http://nfs1.lightspeed.ca
>> [2] http://nfs2.lightspeed.ca
>> [3] http://nfs3.lightspeed.ca
>> [4] http://192.168.1.31:65530
>> [5] http://glusterd.info
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160409/6d25d4ae/attachment.html>