[Gluster-users] Advise on recovering from a bad replica please

Tue Jun 24 22:59:54 UTC 2014

Hi All,

We're using Gluster as the storage for our virtualization. This consists
of 2 servers with a single brick each configured as a replica pair. We
also have a geo-replica on one of those two servers.

For reasons that don't really matter, last weekend we had a situation
which cause one server to reboot a number of times, which in turn
resulted in a lot of heal-failed and split-brain errors. Because at the
same time VMs were being migrated across hosts we ended up with many
crashed VMs.

Due to the need get the VMs up and running with as quickly as possible
we decided to shut down one Gluster replica and use the "primary" one
alone. As the geo-replica is also on the node we shut down that leaves
us with just a single copy, which makes us rather nervous.

As we have decided to treat the files on the currently running node as
"correct", I'd appreciate advise on the best way to get the other node
back into the replication. Should we simply bring it back on line and
try to correct the errors that I expect will be many or should we treat
it as a failed server and bring it back with an empty brick, rather than
what is currently in the existing brick? The volume/bricks are 5TB, of
which we're currently using around 2TB and the servers are on a 10Gb
network, so I imagine it shouldn't take too long to rebuild and this
would all be done out of hours anyway.

regards,
John