[Gluster-users] issues replacing a failed node

John Jolet jjolet at drillinginfo.com
Thu May 17 01:47:19 UTC 2012


I had a two-node replicated/distributed volume, spread across server1:/bricks/1 server2:/bricks/1 server1:/bricks/2 server2:/bricks/2.  I powered down server2 in order to re-rack it to make room for server3.  server2 fails to come up, for reasons having nothing to do with gluster.  So I decided to go ahead and bring up server3 and move server2's bricks to it.  I saw conflicting information on how to do that with a completely dead node and a new node of a different name.  Basically i did a peer probe server3, then volume replace-brick share name server2:/bricks/1 server3:/bricks/1.  then i did a volume replace-brick <blah> commit force.

this was probably a bad thing.  then i tried to do the replace-brick with the second set.  it fails to start saying replace-brick is already running on the volume.  now i'm stuck.  the data in brick/1 DOES appear on the new node, but i can't do anything with brick/2.  

if i try to do a commit, it says bricks/1 isn't on server2, and if i try to do anything else it says replace-brick is running.  i did a rebalance, hoping that would fix it, but it has not.  I attempted to stop the volume, but it said i couldn't until the replace-brick was committed or aborted.  I cannot abort, it says replace-brick abort failed.  Now what?  Mind, this is a temporary setup which has a complex directory structure, but no data as yet.  We are looking to use this for production VERY soon, and i'm not sure that (a) i have time to rebuild everything, and (and more importantly) (b) i need to be able to demonstrate to management that "look, a node failed and we replaced it with no data loss".

so, what's my next step to get this mess untangled, and the data safely on my new node...


More information about the Gluster-users mailing list