[Gluster-users] Replacing a downed brick
Reinis Rozitis
r at roze.lv
Mon Jul 25 18:35:53 UTC 2011
Hello,
while playing around with the new elastic glusterfs system (via 'glusterd', previously have been using glusterfs with static
configuration) I have stumbled upon such problem:
1. I have a test system with 12 nodes in a replicated/distributed way (replica count 3):
Volume Name: storage
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
2. One of the brick systems/servers had a simulated hardware failure (disks have been wiped) and restarted a new.
3. When the server ('glusterd') came up the rest of the bricks received something like:
Jul 25 17:10:45 snode182 GlusterFS[3371]: [2011-07-25 17:10:45.435786] C [glusterd-rpc-ops.c:748:glusterd3_1_cluster_lock_cbk] 0-:
Lock response received from unknown peer: 4ecec354-1d02-4709-8f1e-607a735dbe62
Obviously the peer UID in glusterd.info (because of the full "crash/reinstall") seems to be different from the UID which is in the
cluster configuration.
Peer status shows:
Hostname: 10.0.0.149
Uuid: f9ea651e-68da-40fa-80d9-6bee7779aa97
State: Peer Rejected (Connected)
4. While the info commands work fine anything that involves changing the volume settings return that the volume doesn't exist (from
the logs seem to be coming from the reinstalled node):
[2011-07-25 17:08:54.579631] E [glusterd-op-sm.c:1237:glusterd_op_stage_set_volume] 0-: Volume storage does not exist
[2011-07-25 17:08:54.579769] E [glusterd-op-sm.c:7107:glusterd_op_ac_stage_op] 0-: Validate failed: -1
So my question is how to correctly reintroduce the box to the glusterfs cluster since:
1. I can't run 'peer probe 10.0.0.149' as gluster says the peer is already in cluster
2. I can't remove the peer because it is part of a volume.
3. I can't remove the brick from the volume because gluster asks me to remove 3 bricks (eg the replica count and also would mean
data loss).
4. I imagine that the replace-brick won't work even if I fake the new node with a different ip/hostname (since the source brick will
be down) or will it replicate from the alive ones?
I tried just to manually change the UID back to the the one which is listed in the rest of the nodes (peer status) but apparently it
was not enough (the node itself didn't see any other servers and wasn't able to sync volume information from remote brick(s)
complaining that it is not his friend).
Then when I manually placed all the peers/* files from a running bricknode and restarted glusterd the node reverted to 'Peer in
Cluster' state.
Is this the way?
Or am I doing something totally wrong?
wbr
rr
More information about the Gluster-users
mailing list