[Bugs] [Bug 1322145] Glusterd fails to restart after replacing a failed GlusterFS node and a volume has a snapshot

Tue Jun 21 17:50:45 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1322145

Ben Werthmann <ben at apcera.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|CLOSED                      |ASSIGNED
         Resolution|NOTABUG                     |---
           Keywords|                            |Reopened

--- Comment #8 from Ben Werthmann <ben at apcera.com> ---
Yes, the replacement peer (1 of 3) and new brick have a new IP. In our case,
DNS is not available. When the failed brick is removed via 'gluster volume
replace-brick $vol $failed_peer $new_peer_ip:$new_brick commit force' why are
there lingering references to the old peer/brick? Is there a reason that
'replace-brick' does not fix all of references to the old peer/brick? If
'replace-brick' has been issued, is it safe to drop the snapshot references of
the old peer/brick?

I suspect the suggested fix of using the hostname will fail if any snapshots
exist because in the case of new peer/new brick case, the new peer will not
have the LVM snapshots needed to resolve the snapshot references.

Put another way: why is the following set of operations not valid?
- deploy gluster with three servers, one brick each, one volume replicated
across all 3
- create a snapshot
- lose one server
- add a replacement peer and new brick with a new IP address
- replace-brick the missing brick onto the new server (wait for replication to
finish)
- force remove the old server
- verify everything is working as expected
- restart _any_ server in the cluster, without failure

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=q64KeSDcCL&a=cc_unsubscribe