[Gluster-users] Gluster 3.7.6 add new node state Peer Rejected (Connected)

Mon Feb 29 19:28:24 UTC 2016

I changed quota-version=1 on the two new nodes, and was able to join the
cluster. I also rebooted the two new nodes and everything came up correctly.

Then I triggered a rebalance fix-layout and one of the original cluster
members (node gluster03) glusterd crashed. I restarted glusterd and was
connected but after a few minutes I'm left with:

# gluster peer status
Number of Peers: 5

Hostname: 10.0.231.51
Uuid: b01de59a-4428-486b-af49-cb486ab44a07
State: Peer in Cluster (Connected)

Hostname: 10.0.231.52
Uuid: 75143760-52a3-4583-82bb-a9920b283dac
*State: Peer Rejected (Connected)*

Hostname: 10.0.231.53
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
State: Peer in Cluster (Connected)

Hostname: 10.0.231.54
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
State: Peer in Cluster (Connected)

Hostname: 10.0.231.55
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
State: Peer in Cluster (Connected)

I see in the logs (attached) there is now a cksum error:

[2016-02-29 19:16:42.082256] E [MSGID: 106010]
[glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
Version of Cksums storage differ. local cksum = 50348222, remote cksum =
50348735 on peer 10.0.231.55
[2016-02-29 19:16:42.082298] I [MSGID: 106493]
[glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd:
Responded to 10.0.231.55 (0), ret: 0
[2016-02-29 19:16:42.092535] I [MSGID: 106493]
[glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT
from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0
[2016-02-29 19:16:42.096036] I [MSGID: 106143]
[glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick
/mnt/lv-export-domain-storage/export-domain-storage on port 49153
[2016-02-29 19:16:42.097296] I [MSGID: 106143]
[glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick
/mnt/lv-vm-storage/vm-storage on port 49155
[2016-02-29 19:16:42.100727] I [MSGID: 106163]
[glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 30700
[2016-02-29 19:16:42.108495] I [MSGID: 106490]
[glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd:
Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
[2016-02-29 19:16:42.109295] E [MSGID: 106010]
[glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
Version of Cksums storage differ. local cksum = 50348222, remote cksum =
50348735 on peer 10.0.231.53
[2016-02-29 19:16:42.109338] I [MSGID: 106493]
[glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd:
Responded to 10.0.231.53 (0), ret: 0
[2016-02-29 19:16:42.119521] I [MSGID: 106143]
[glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick
/mnt/lv-env-modules/env-modules on port 49157
[2016-02-29 19:16:42.122856] I [MSGID: 106143]
[glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick
/mnt/raid6-storage/storage on port 49156
[2016-02-29 19:16:42.508104] I [MSGID: 106493]
[glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT
from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0
[2016-02-29 19:16:42.519403] I [MSGID: 106163]
[glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 30700
[2016-02-29 19:16:42.524353] I [MSGID: 106490]
[glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd:
Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07
[2016-02-29 19:16:42.524999] E [MSGID: 106010]
[glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
Version of Cksums storage differ. local cksum = 50348222, remote cksum =
50348735 on peer 10.0.231.51
[2016-02-29 19:16:42.525038] I [MSGID: 106493]
[glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd:
Responded to 10.0.231.51 (0), ret: 0
[2016-02-29 19:16:42.592523] I [MSGID: 106493]
[glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT
from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0
[2016-02-29 19:16:42.599518] I [MSGID: 106163]
[glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 30700
[2016-02-29 19:16:42.604821] I [MSGID: 106490]
[glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd:
Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
[2016-02-29 19:16:42.605458] E [MSGID: 106010]
[glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
Version of Cksums storage differ. local cksum = 50348222, remote cksum =
50348735 on peer 10.0.231.54
[2016-02-29 19:16:42.605492] I [MSGID: 106493]
[glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd:
Responded to 10.0.231.54 (0), ret: 0
[2016-02-29 19:16:42.621943] I [MSGID: 106163]
[glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 30700
[2016-02-29 19:16:42.628443] I [MSGID: 106490]
[glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd:
Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f
[2016-02-29 19:16:42.629079] E [MSGID: 106010]
[glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
Version of Cksums storage differ. local cksum = 50348222, remote cksum =
50348735 on peer 10.0.231.50

On gluster01/02/04/05
/var/lib/glusterd/vols/storage/cksum info=998305000

On gluster03
/var/lib/glusterd/vols/storage/cksum info=998305001

How do I recover from this? Can I just stop glusterd on gluster03 and
change the cksum value?

On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <rkavunga at redhat.com>
wrote:

>
>
> On 02/26/2016 01:53 AM, Mohammed Rafi K C wrote:
>
>
>
> On 02/26/2016 01:32 AM, Steve Dainard wrote:
>
> I haven't done anything more than peer thus far, so I'm a bit confused as
> to how the volume info fits in, can you expand on this a bit?
>
> Failed commits? Is this split brain on the replica volumes? I don't get
> any return from 'gluster volume heal <volname> info' on all the replica
> volumes, but if I try a gluster volume heal <volname> full I get:
> 'Launching heal operation to perform full self heal on volume <volname> has
> been unsuccessful'.
>
>
> forget about this. it is not for metadata selfheal .
>
>
> I have 5 volumes total.
>
> 'Replica 3' volumes running on gluster01/02/03:
> vm-storage
> iso-storage
> export-domain-storage
> env-modules
>
> And one distributed only volume 'storage' info shown below:
>
> *From existing host gluster01/02:*
> type=0
> count=4
> status=1
> sub_count=0
> stripe_count=1
> replica_count=1
> disperse_count=0
> redundancy_count=0
> version=25
> transport-type=0
> volume-id=26d355cb-c486-481f-ac16-e25390e73775
> username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
> password=
> op-version=3
> client-op-version=3
> quota-version=1
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> features.quota-deem-statfs=on
> features.inode-quota=on
> diagnostics.brick-log-level=WARNING
> features.quota=on
> performance.readdir-ahead=on
> performance.cache-size=1GB
> performance.stat-prefetch=on
> brick-0=10.0.231.50:-mnt-raid6-storage-storage
> brick-1=10.0.231.51:-mnt-raid6-storage-storage
> brick-2=10.0.231.52:-mnt-raid6-storage-storage
> brick-3=10.0.231.53:-mnt-raid6-storage-storage
>
> *From existing host gluster03/04:*
> type=0
> count=4
> status=1
> sub_count=0
> stripe_count=1
> replica_count=1
> disperse_count=0
> redundancy_count=0
> version=25
> transport-type=0
> volume-id=26d355cb-c486-481f-ac16-e25390e73775
> username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
> password=
> op-version=3
> client-op-version=3
> quota-version=1
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> features.quota-deem-statfs=on
> features.inode-quota=on
> performance.stat-prefetch=on
> performance.cache-size=1GB
> performance.readdir-ahead=on
> features.quota=on
> diagnostics.brick-log-level=WARNING
> brick-0=10.0.231.50:-mnt-raid6-storage-storage
> brick-1=10.0.231.51:-mnt-raid6-storage-storage
> brick-2=10.0.231.52:-mnt-raid6-storage-storage
> brick-3=10.0.231.53:-mnt-raid6-storage-storage
>
> So far between gluster01/02 and gluster03/04 the configs are the same,
> although the ordering is different for some of the features.
>
> On gluster05/06 the ordering is different again, and the quota-version=0
> instead of 1.
>
>
> This is why the peer shows as rejected. Can you check the op-version of
> all the glusterd including the one which is in reject state. you can find
> out the op-version here in  /var/lib/glusterd/glusterd.info
>
>
> If all the op-version are same and 3.7.6, then to work-around the issue,
> you can manually make it quota-version=1, and restarting the glusterd will
> solve the problem, But I would strongly recommend you to figure out the
> RCA. May be you can file a bug for this.
>
> Rafi
>
>
>
> Rafi KC
>
>
> *From new hosts gluster05/gluster06:*
> type=0
> count=4
> status=1
> sub_count=0
> stripe_count=1
> replica_count=1
> disperse_count=0
> redundancy_count=0
> version=25
> transport-type=0
> volume-id=26d355cb-c486-481f-ac16-e25390e73775
> username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c
> password=
> op-version=3
> client-op-version=3
> quota-version=0
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> performance.stat-prefetch=on
> performance.cache-size=1GB
> performance.readdir-ahead=on
> features.quota=on
> diagnostics.brick-log-level=WARNING
> features.inode-quota=on
> features.quota-deem-statfs=on
> brick-0=10.0.231.50:-mnt-raid6-storage-storage
> brick-1=10.0.231.51:-mnt-raid6-storage-storage
> brick-2=10.0.231.52:-mnt-raid6-storage-storage
> brick-3=10.0.231.53:-mnt-raid6-storage-storage
>
> Also, I forgot to mention that when I initially peer'd the two new hosts,
> glusterd crashed on gluster03 and had to be restarted (log attached) but
> has been fine since.
>
> Thanks,
> Steve
>
> On Thu, Feb 25, 2016 at 11:27 AM, Mohammed Rafi K C <rkavunga at redhat.com>
> wrote:
>
>>
>>
>> On 02/25/2016 11:45 PM, Steve Dainard wrote:
>>
>> Hello,
>>
>> I upgraded from 3.6.6 to 3.7.6 a couple weeks ago. I just peered 2 new
>> nodes to a 4 node cluster and gluster peer status is:
>>
>> # gluster peer status *<-- from node gluster01*
>> Number of Peers: 5
>>
>> Hostname: 10.0.231.51
>> Uuid: b01de59a-4428-486b-af49-cb486ab44a07
>> State: Peer in Cluster (Connected)
>>
>> Hostname: 10.0.231.52
>> Uuid: 75143760-52a3-4583-82bb-a9920b283dac
>> State: Peer in Cluster (Connected)
>>
>> Hostname: 10.0.231.53
>> Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411
>> State: Peer in Cluster (Connected)
>>
>> Hostname: 10.0.231.54 *<-- new node gluster05*
>> Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c
>> *State: Peer Rejected (Connected)*
>>
>> Hostname: 10.0.231.55 *<-- new node gluster06*
>> Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3
>> *State: Peer Rejected (Connected)*
>>
>>
>> Looks like your configuration files are mismatching, ie the checksum
>> calculation differs on this two node than the others,
>>
>> Did you had any failed commit ?
>>
>> Compare your /var/lib/glusterd/<volname>/info of the failed node against
>> good one, mostly you could see some difference.
>>
>> can you paste the /var/lib/glusterd/<volname>/info ?
>>
>> Regards
>> Rafi KC
>>
>>
>>
>> I followed the write-up here:
>> http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected
>> and the two new nodes peer'd properly but after a reboot of the two new
>> nodes I'm seeing the same Peer Rejected (Connected) State.
>>
>> I've attached logs from an existing node, and the two new nodes.
>>
>> Thanks for any suggestions,
>> Steve
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160229/3dc57fbd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: etc-glusterfs-glusterd.vol.log.gluster03
Type: application/octet-stream
Size: 56421 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160229/3dc57fbd/attachment.obj>