[Gluster-users] Need to replace a brick on a failed first Gluster node

Sun Jan 22 11:00:17 UTC 2012

Hello - 

I am using Glusterfs 3.2.5-2.  I have one very small replicated volume
with 2 bricks, as follows:

[root at lme-fw2 ~]# gluster volume info

Volume Name: firewall-scripts

Type: Replicate

Status: Started

Number of Bricks: 2

Transport-type: tcp

Bricks:

Brick1: 192.168.253.1:/gluster-fw1

Brick2: 192.168.253.2:/gluster-fw2

The application is a small active/standby HA appliance and I use the
Gluster volume for config info.  The Gluster nodes are also clients and
there are no other clients.  Fortunately for me, nothing is in
production yet.

My challenge is, the hard drive at 192.168.253.1 failed.  This was the
first Gluster node when I set everything up.   I replaced its hard drive
and am rebuilding it.  I have a good copy of everything I care about in
the 192.168.253.2 brick.   My thought was, I could just remove the old
192.168.253.1 brick and replica, then gluster peer and add it all back
again.   

But apparently not so simple:

[root at lme-fw2 ~]# gluster volume remove-brick firewall-scripts
192.168.253.1:/gluster-fw1

Removing brick(s) can result in data loss. Do you want to Continue?
(y/n) y

Incorrect brick 192.168.253.1:/gluster-fw1 for volume firewall-scripts

Not particularly helpful diagnostic info.  I also played around with
gluster peer detach/attach, but now I think I may have created a mess:

[root at lme-fw2 ~]# gluster peer probe 192.168.253.1

^C

[root at lme-fw2 ~]# gluster peer status

Number of Peers: 1

Hostname: 192.168.253.1

Uuid: 00000000-0000-0000-0000-000000000000

State: Establishing Connection (Disconnected)

[root at lme-fw2 ~]#

Trying again:

[root at lme-fw2 ~]# gluster peer detach 192.168.253.1

Detach successful

[root at lme-fw2 ~]# gluster peer status

No peers present

[root at lme-fw2 ~]# gluster volume info

Volume Name: firewall-scripts

Type: Replicate

Status: Started

Number of Bricks: 2

Transport-type: tcp

Bricks:

Brick1: 192.168.253.1:/gluster-fw1

Brick2: 192.168.253.2:/gluster-fw2

[root at lme-fw2 ~]# gluster volume remove-brick firewall-scripts
192.168.253.1:/gluster-fw1

Removing brick(s) can result in data loss. Do you want to Continue?
(y/n) y

Incorrect brick 192.168.253.1:/gluster-fw1 for volume firewall-scripts

[root at lme-fw2 ~]#

This should be simple and maybe I am missing something.  On the fw2
Gluster node, I want to remove all trace of the old fw1 and then set up
a new fw1 as a new replica.  How do I get there from here?  Also, once
this goes into production, I will not have the luxury of taking
everything offline and rebuilding it.  What is the best way to recover
from a hard drive failure on either node?

Thanks

-          Greg Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120122/546e0c1c/attachment.html>