[Gluster-users] Strange file corruption

Mon Dec 7 11:03:21 UTC 2015

Hi all,

yesterday I had a strange situation where Gluster healing corrupted 
*all* my VM images.

In detail:
I had about 15 VMs running (in Proxmox 4.0) totaling about 600 GB of 
qcow2 images. Gluster is used as storage for those images in replicate 3 
setup (ie. 3 physical servers replicating all data).
All VMs were running on machine #1 - the two other machines (#2 and #3) 
were *idle*.
Gluster was fully operating (no healing) when I rebooted machine #2.
For other reasons I had to reboot machines #2 and #3 a few times, but 
since all VMs were running on machine #1 and nothing on the other 
machines was accessing Gluster files, I was confident that this wouldn't 
disturb Gluster.
But anyway this means that I rebootet Gluster nodes during a healing 
process.

After a few minutes, Gluster files began showing corruption - up to the 
point that the qcow2 files became unreadable and all VMs stopped working.

I was forced to restore VM backups (loosing a few hours of data), which 
means that the corrupt files were left as-is by Proxmox and new qcow2 
files were created to be used for the VMs.

This means that Gluster could continue healing it's files all night 
long. Today, many files seem to be intact, but I think they have been 
replaced with older versions (it's a bit difficult to tell exactly)

Please note that node #1 was up at all times. It seems to me that the 
healing process corrupted the files. How can that be?
Anybody has an explanation? Any way to avoid such a situation (except 
checking if Gluster is healing before rebooting)?

Setup details:
- Proxmox 4.0 cluster (not yet in HA mode) = Debian 8 Jessie
- redundant Gbit LAN (bonding)
- Gluster 3.5.2 (most current Proxmox package)
- two volumes, both "replicate" type, 1 x 3 = 3 bricks
- cluster.server-quorum-ratio: 51%

Thanks,
Udo