[Gluster-users] Strange file corruption

Wed Dec 9 16:17:48 UTC 2015

> A-1) shut down node #1 (the first that is about to be upgraded)
> A-2) remove node #1 from the Proxmox cluster (/pvevm delnode "metal1"/)
> A-3) remove node #1 from the Gluster volume/cluster (/gluster volume 
> remove-brick ... && gluster peer detach "metal1"/)
> A-4) install Debian Jessie on node #1, overwriting all data on the HDD 
> -*with same Network settings and hostname as before*
> A-5)install Proxmox 4.0 
> <https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Jessie>on 
> node #1
> A-6) install Gluster on node #1 and add it back to the Gluster volume 
> (/gluster volume add-brick .../) => shared storage will be complete 
> again (spanning 3.4 and 4.0 nodes)
> A-7) configure the Gluster volume as shared storage in Proxmox 4 (node #1)
> A-8) configure the external Backup storage on node #1 (Proxmox 4)

Was the data on the gluster brick deleted as part of step 4? When you 
remove the brick, gluster will no longer track pending changes for that 
brick. If you add it back in with stale data but matching gfids, you 
would have two clean bricks with mismatching data. Did you have to use 
"add-brick...force"?

On 12/09/2015 06:53 AM, Udo Giacomozzi wrote:
> Am 09.12.2015 um 14:39 schrieb Lindsay Mathieson:
>>
>> Udo, it occurs to me that if your VM's were running on #2 & #3 and 
>> you live migrated them to #1 prior to rebooting #2/3, then you would 
>> indeed rapidly get progressive VM corruption.
>>
>> However it wouldn't be due to the heal process, but rather the live 
>> migration with "performance.stat-prefetch" on. This always leads to 
>> qcow2 files becoming corrupted and unusable.
>
> Nope. All VMs were running on #1, no exception.
> Nodes #2 and #3 never had a VM running on them, so they were 
> pratically idle since their installation.
>
> Basically I set up node #1, including all VMs.
> Then I've installed nodes #2 and #3, configured Proxmox and Gluster 
> cluster and then waited quite some time until Gluster had synced up 
> nodes #2 and #3 (healing).
> From then on, I've rebooted nodes 2 & 3, but in theory these nodes 
> never had to do any writes to the Gluster volume at all.
>
> If you're interested, you can read about my upgrade strategy in this 
> Proxmox forum post: 
> http://forum.proxmox.com/threads/24990-Upgrade-3-4-HA-cluster-to-4-0-via-reinstallation-with-minimal-downtime?p=125040#post125040
>
> Also, It seems rather strange to me that pratically all ~15 VMs (!) 
> suffered from data corruption. It's like if Gluster considered node #2 
> or #3 to be ahead and it "healed" in the wrong direction. I don't know..
>
> BTW, once I understood what was going on, /with the problematic 
> "healing" still in progress/, I was able to overwrite the bad images 
> (still active on #1) by using standard Proxmox backup-restore and 
> Gluster handled it correctly.
>
>
> Anway, I really love the simplicity of Gluster (setting up and 
> maintaining a cluster is extremely easy), but these healing issues are 
> causing some headache to me... ;-)
>
> Udo
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151209/a1c51e58/attachment.html>