[Gluster-users] "gluster peer status" messed up
Brian Candler
B.Candler at pobox.com
Mon Dec 3 13:44:47 UTC 2012
I have three machines, all Ubuntu 12.04 running gluster 3.3.1.
storage1 192.168.6.70 on 10G, 192.168.5.70 on 1G
storage2 192.168.6.71 on 10G, 192.168.5.71 on 1G
storage3 192.168.6.72 on 10G, 192.168.5.72 on 1G
Each machine has two NICs, but on each host, /etc/hosts lists the 10G
interface on all machines.
storage1 and storage3 were taken away for hardware changes, which included
swapping the boot disks. They had the O/S reinstalled.
Somehow I have gotten into a state where "gluster peer status" is broken.
[on storage1]
# gluster peer status
(Just hangs here until I press ^C)
[on storage2]
# gluster peer status
Number of Peers: 2
Hostname: 192.168.6.70
Uuid: bf320f69-2713-4b57-9003-a721a8101bc6
State: Peer in Cluster (Connected)
Hostname: storage3
Uuid: 1b058f9f-c116-496f-8b50-fb581f9625f0
State: Peer Rejected (Connected) << note "Rejected"
[on storage3]
# gluster peer status
Number of Peers: 2
Hostname: 192.168.6.70
Uuid: 698ee46d-ab8c-45f6-a6b6-7af998430a37
State: Peer in Cluster (Connected)
Hostname: storage2
Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59
State: Peer Rejected (Connected) << note "Rejected"
Poking around the filesystem a bit:
[on storage1]
root at storage1:~# cat /var/lib/glusterd/glusterd.info
UUID=bf320f69-2713-4b57-9003-a721a8101bc6
root at storage1:~# ls /var/lib/glusterd/peers/
2c0670f4-c3ba-46e0-92a8-108e71832b59
root at storage1:~# head /var/lib/glusterd/peers/*uuid=2c0670f4-c3ba-46e0-92a8-108e71832b59
state=4
hostname1=storage2
[on storage2]
# cat /var/lib/glusterd/glusterd.info
UUID=2c0670f4-c3ba-46e0-92a8-108e71832b59
# head /var/lib/glusterd/peers/*
==> /var/lib/glusterd/peers/1b058f9f-c116-496f-8b50-fb581f9625f0 <==
uuid=1b058f9f-c116-496f-8b50-fb581f9625f0
state=6
hostname1=storage3
==> /var/lib/glusterd/peers/698ee46d-ab8c-45f6-a6b6-7af998430a37 <==
uuid=bf320f69-2713-4b57-9003-a721a8101bc6
state=3
hostname1=192.168.6.70
[on storage3]
# cat /var/lib/glusterd/glusterd.info
UUID=1b058f9f-c116-496f-8b50-fb581f9625f0
# head /var/lib/glusterd/peers/*
==> /var/lib/glusterd/peers/2c0670f4-c3ba-46e0-92a8-108e71832b59 <==
uuid=2c0670f4-c3ba-46e0-92a8-108e71832b59
state=6
hostname1=storage2
==> /var/lib/glusterd/peers/698ee46d-ab8c-45f6-a6b6-7af998430a37 <==
uuid=698ee46d-ab8c-45f6-a6b6-7af998430a37
state=3
hostname1=192.168.6.70
Obvious problems:
- storage1 is known to its peers by IP address, not by hostname
- storage3 has the wrong UUID for storage1
- storage2 and storage3 are failing to be peers, "Peer Rejected" whatever
that means (however I do have clients accessing data on a volume on
storage2 and a volume on storage3)
On storage1, typing "gluster peer detach storage2" or "gluster peer detach
storage3" just hangs.
Detaching storage1 from the other side fails:
root at storage2:~# gluster peer detach storage1
One of the peers is probably down. Check with 'peer status'.
root at storage2:~# gluster peer detach 192.168.6.70
One of the peers is probably down. Check with 'peer status'.
Then I found something very suspicious on storage1:
root at storage1:~# tail /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
[2012-12-03 12:50:36.208029] I [glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared local lock
[2012-12-03 12:51:05.023553] I [glusterd-handler.c:1168:glusterd_handle_sync_volume] 0-glusterd: Received volume sync req for volume all
[2012-12-03 12:51:05.023741] I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by bf320f69-2713-4b57-9003-a721a8101bc6
[2012-12-03 12:51:05.023761] I [glusterd-handler.c:463:glusterd_op_txn_begin] 0-management: Acquired local lock
[2012-12-03 12:51:05.024176] I [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: Received ACC from uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59
[2012-12-03 12:51:05.024214] C [glusterd-op-sm.c:1946:glusterd_op_build_payload] 0-management: volname is not present in operation ctx
pending frames:
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
root at storage1:~# ps auxwww | grep gluster
root 1584 0.0 0.1 230516 10668 ? Ssl 11:36 0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid
root 6466 0.0 0.0 9392 920 pts/0 S+ 13:35 0:00 grep --color=auto gluster
Hmm... so as you can see, there was a SEGV signal, however glusterd was
still running. But stopping it and starting it I was able to do "gluster
peer status" again.
root at storage1:~# service glusterfs-server stop
ps auxwww | grep glusterglusterfs-server stop/waiting
root at storage1:~# ps auxwww | grep gluster
root 6478 0.0 0.0 9388 920 pts/0 S+ 13:36 0:00 grep --color=auto gluster
root at storage1:~# service glusterfs-server start
glusterfs-server start/running, process 6485
root at storage1:~# gluster peer status
Number of Peers: 1
Hostname: storage2
Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59
State: Peer in Cluster (Connected)
root at storage1:~#
But I still cannot detach from either side. From the storage1 side:
root at storage1:~# gluster peer status
Number of Peers: 1
Hostname: storage2
Uuid: 2c0670f4-c3ba-46e0-92a8-108e71832b59
State: Peer in Cluster (Connected)
root at storage1:~# gluster peer detach storage2
Brick(s) with the peer storage2 exist in cluster
>From the storage2 side:
root at storage2:~# gluster peer status
Number of Peers: 2
Hostname: 192.168.6.70
Uuid: bf320f69-2713-4b57-9003-a721a8101bc6
State: Peer in Cluster (Connected)
Hostname: storage3
Uuid: 1b058f9f-c116-496f-8b50-fb581f9625f0
State: Peer Rejected (Connected)
root at storage2:~# gluster peer detach 192.168.6.70
One of the peers is probably down. Check with 'peer status'.
root at storage2:~# gluster peer detach storage1
One of the peers is probably down. Check with 'peer status'.
So this all looks broken, and as I can't find any gluster documentation
saying what these various states mean, I'm not sure how to proceed. Any
suggestions?
Note: I have no replicated volumes, only distributed ones.
Thanks,
Brian.
More information about the Gluster-users
mailing list