[Gluster-users] poor OS reinstalls with 3.7

Fri Feb 12 03:24:12 UTC 2016

So, I lost one of my servers and the OS was reinstalled.  The gluster data is on another disk that survives OS reinstalls.  /var/lib/gluster however does not.

I was following the bring it back up directions, but before I did that, I think a peer probe was done with the new uuid.  This caused it to be dropped from the cluster, entirely.

I edited the uuid to be back what it was, but now it is no longer in the cluster.  The web site didn’t seem to have any help for how to undo the drop.  It was part of a replica 2 pair, and I would like to merely have it come up and be apart of the cluster again.  It has all the data (as I run with quorum and all the replica 2 pair contents are R/O until this server comes back).  I don’t mind letting it refresh from the other pair member of the replica, even though the data is already on disk.

I tried:

# gluster volume replace-brick g2home machine04:/.g/g2home machine04:/.g/g2home-new commit force
volume replace-brick: failed: Host machine04 is not in 'Peer in Cluster’ state

to try and let it resync into the cluster, but, it won’t let me replace the brick.  I can’t do:

# gluster peer detach machine04
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

either.  What I wanted it to do, it when it connected to the cluster the first time with the new uuid, the cluster should inform it it might have filesystems on it (it comes in with a name already in the peer list), and get brick information from the cluster and check it out.  If it has those, it should just notice the uuid is wrong, fix it, make it part of the cluster again, spin it all up and continue on.

I tried;

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Volume g2home does not exist

and it didn’t work on either machine04, nor one of the peers:

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Operation failed

So, to try and fix the Peer in Cluster issue, I stop and restarted glistered many time, and eventually most all resented and came up into the Peer in Cluster state.  All except for 1 that was endlessly confused.  So, if the network works, it should wipe the peer, and just retry the entire state machine to get back into the right state.  I had to stop the server on the two machines and then manually edit the state to be 3, and then restart them.  It then at least showed the right state on both.

Next, let’s try and sync up the bricks:

root at machine04:/# gluster volume sync machine00 all
Sync volume may make data inaccessible while the sync is in progress. Do you want to continue? (y/n) y
volume sync: success
root at machine04:/# gluster vol info
No volumes present

root at machine02:/# gluster volume heal g2home full
Staging failed on machine04. Error: Volume g2home does not exist

Think about that.  This is a replica 2 server, the entire point would be to fix up the array if one of the machines was screwy.  heal seemed like the command to fix it up.

So, now that it is connected, let’s try this again:

# gluster volume replace-brick g2home machine04:/.g/g2home machine04:/.g/g2home-new commit force
volume replace-brick: failed: Pre Validation failed on machine04. volume: g2home does not exist

Nope, that won’t work.  So, let’s try removing:

# gluster vol remove-brick g2home replica 2 machine04:/.g/g2home machine05:/.g/g2home start
volume remove-brick start: failed: Staging failed on machine04. Please check log file for details.

Nope, that won’t either. What’s the point of remove, if it won’t work?

Ok, fine, lets for for a bigger hammer:

# gluster peer detach machine04 force
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

Doh.  I know that, but, it is a replica!

[ more googling ]

Someone said to just copy the entire vols directory.  [ cross fingers ] copy vols.

Ok, I can now do a gluster volume status g2home detail, which I could not before.  Files seem to be R/W on the array now.  I think that might have worked.

So, why can’t gluster copy vols by itself, if indeed that is the right thing to do?

Why can’t the document say, just edit the state variable and just copy vols to get it going again?

Why can’t probe figure out that you were already part of a cluster, and when it runs, it notices that your brains have been wiped, and just grab that info from the cluster and bring the node back up?  It can even run heal on the data to ensure that nothing messed with it and that it matches the other replica.