[Gluster-users] Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

Fri Apr 8 17:10:29 UTC 2016

On 2016-04-07 09:16, Atin Mukherjee wrote:
> -Atin
> Sent from one plus one
> On 07-Apr-2016 9:32 pm, "Ernie Dunbar" <maillist at lightspeed.ca> wrote:
>> 
>> On 2016-04-06 21:20, Atin Mukherjee wrote:
>>> 
>>> On 04/07/2016 04:04 AM, Ernie Dunbar wrote:
>>>> 
>>>> On 2016-04-06 11:42, Ernie Dunbar wrote:
>>>>> 
>>>>> I've already successfully created a Gluster cluster, but when I
> try to
>>>>> add a new node, gluster on the new node claims it can't find the
>>>>> hostname of the first node in the cluster.
>>>>> 
>>>>> I've added the hostname nfs1.lightspeed.ca [1] to /etc/hosts like
> this:
>>>>> 
>>>>> root at nfs3:/home/ernied# cat /etc/hosts
>>>>> 127.0.0.1    localhost
>>>>> 192.168.1.31    nfs1.lightspeed.ca [1]      nfs1
>>>>> 192.168.1.32    nfs2.lightspeed.ca [2]      nfs2
>>>>> 127.0.1.1    nfs3.lightspeed.ca [3]    nfs3
>>>>> 
>>>>> 
>>>>> # The following lines are desirable for IPv6 capable hosts
>>>>> ::1     localhost ip6-localhost ip6-loopback
>>>>> ff02::1 ip6-allnodes
>>>>> ff02::2 ip6-allrouters
>>>>> 
>>>>> I can ping the hostname:
>>>>> 
>>>>> root at nfs3:/home/ernied# ping -c 3 nfs1
>>>>> PING nfs1.lightspeed.ca [1] (192.168.1.31) 56(84) bytes of data.
>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=1
> ttl=64
>>>>> time=0.148 ms
>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=2
> ttl=64
>>>>> time=0.126 ms
>>>>> 64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=3
> ttl=64
>>>>> time=0.133 ms
>>>>> 
>>>>> --- nfs1.lightspeed.ca [1] ping statistics ---
>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 1998ms
>>>>> rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms
>>>>> 
>>>>> I can get gluster to probe the hostname:
>>>>> 
>>>>> root at nfs3:/home/ernied# gluster peer probe nfs1
>>>>> peer probe: success. Host nfs1 port 24007 already in peer list
>>>>> 
>>>>> But if I try to create the brick on the new node, it says that
> the
>>>>> host can't be found? Um...
>>>>> 
>>>>> root at nfs3:/home/ernied# gluster volume create gv2 replica 3
>>>>> nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/
>>>>> nfs3.lightspeed.ca:/brick1/gv2
>>>>> volume create: gv2: failed: Failed to find host
> nfs1.lightspeed.ca [1]
>>>>> 
>>>>> Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log:
>>>>> 
>>>>> [2016-04-06 18:19:18.107459] E [MSGID: 106452]
>>>>> [glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management:
>>>>> Failed to find host nfs1.lightspeed.ca [1]
>>>>> [2016-04-06 18:19:18.107496] E [MSGID: 106536]
>>>>> [glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume]
>>>>> 0-management: Failed to find host nfs1.lightspeed.ca [1]
>>>>> [2016-04-06 18:19:18.107516] E [MSGID: 106301]
>>>>> [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging
> of
>>>>> operation 'Volume Create' failed on localhost : Failed to find
> host
>>>>> nfs1.lightspeed.ca [1]
>>>>> [2016-04-06 18:19:18.231864] E [MSGID: 106170]
>>>>> [glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req]
> 0-management:
>>>>> Request from peer 192.168.1.31:65530 [4] has an entry in
> peerinfo, but
>>>>> uuid does not match
>>> 
>>> We have introduced a new check to reject a peer if the request is
> coming
>>> from a node where the hostname matches but UUID is different. This
> can
>>> happen if a node goes through a re-installation and its
>>> /var/lib/glusterd/* content is wiped off. Look at [1] for more
> details.
>>> 
>>> [1] http://review.gluster.org/13519
>>> 
>>> Do confirm if that's the case.
>> 
>> 
>> 
>> I couldn't say if that's *exactly* the case, but it's pretty close.
> I don't recall ever removing /var/lib/glusterd/* or any of its
> contents, but the operating system isn't exactly the way it was when I
> first tried to add this node to the cluster.
>> 
>> What should I do to *fix* the problem though, so I can add this node
> to the cluster? This bug report doesn't appear to provide a solution.
> I've tried removing the node from the cluster, and that failed too.
> Things seem to be in a very screwey state right now.
> 
> I should have given the work around earlier. Find the peer file for
> the faulty node in /var/lib/glusterd/peers/ and delete the same from
> all the nodes but the faulty node. Restart glusterd instance on all
> those nodes. Ensure /var/lib/glusterd/ content is empty, restart
> glusterd and then peer probe this node from any of the node in the
> existing cluster. You should also bump up the op-version once cluster
> is stable.
> 

This mostly solved the problem, but it seems you were missing one step:

# gluster peer detach <wonky node>

After probing the new node again, I was able to add it to the cluster. 
Without doing this step,
attempting to add the new node to the cluster just resulted in this 
error message:

volume create: gv0: failed: Host 192.168.1.33 is not in 'Peer in 
Cluster' state

>> 
>> 
>>> 
>>>>> [2016-04-06 18:19:18.231919] E [MSGID: 106170]
>>>>> [glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req]
> 0-management:
>>>>> Rejecting management handshake request from unknown peer
>>>>> 192.168.1.31:65530 [4]
>>>>> 
>>>>> That error about the entry in peerinfo doesn't match anything in
>>>>> Google besides the source code for Gluster. My guess is that my
>>>>> earlier unsuccessful attempts to add this node before v3.7.10
> have
>>>>> created a conflict that needs to be cleared.
>>>> 
>>>> 
>>>> 
>>>> More interesting, is what happens when I try to add the third
> server to
>>>> the brick from the first gluster server:
>>>> 
>>>> root at nfs1:/home/ernied# gluster volume add-brick gv2 replica 3
>>>> nfs3:/brick1/gv2
>>>> volume add-brick: failed: One or more nodes do not support the
> required
>>>> op-version. Cluster op-version must atleast be 30600.
>>>> 
>>>> Yet, when I view the operating version in
> /var/lib/glusterd/glusterd.info [5]:
>>>> 
>>>> root at nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>> UUID=1207917a-23bc-4bae-8238-cd691b7082c7
>>>> operating-version=30501
>>>> 
>>>> root at nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>> UUID=e394fcec-41da-482a-9b30-089f717c5c06
>>>> operating-version=30501
>>>> 
>>>> root at nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
>>>> UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed
>>>> operating-version=30501
>>>> 
>>>> I see that the operating version is the same on all nodes!
>>> 
>>> Here cluster op-version is pretty old. You need to make sure that
> you
>>> bump up the op-version by 'gluster volume set all
> cluster.op-version
>>> 30710'. add-brick code path has a check that your cluster
> op-version has
>>> to be at least 30600 if you are with gluster version >=3.6 which is
> the
>>> case here.
>>>> 
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>> 
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> Links:
> ------
> [1] http://nfs1.lightspeed.ca
> [2] http://nfs2.lightspeed.ca
> [3] http://nfs3.lightspeed.ca
> [4] http://192.168.1.31:65530
> [5] http://glusterd.info