[Gluster-users] VM disks corruption on 3.7.11

Fri May 20 00:25:53 UTC 2016

The I/O errors are happening after, not during the heal.
As described, I just rebooted a node, waited for the heal to finish,
rebooted another, waited for the heal to finish then rebooted the third.
From that point, the VM just has a lot of I/O errors showing whenever I
use the disk a lot (importing big MySQL dumps). The VM "screen" on the console
tab of proxmox just spams I/O errors from that point, which it didn't before rebooting
the gluster nodes. Tried to poweroff the VM and force full heals, but I didn't find
a way to fix the problem short of deleting the VM disk and restoring it from a backup.

I have 3 other servers on 3.7.6 where that problem isn't happening, so it might be a 3.7.11 bug,
but since the raid card failed recently on one of the nodes I'm not really sure some other
piece of hardware isn't at fault .. Unfortunatly I don't have the hardware to test that.
The only way to be sure would be to upgrade the 3.7.6 nodes to 3.7.11 and repeat the same tests,
but those nodes are in production and the VM freezes during the heal last month already
caused huge problems for our clients, really can't afford any other problems there,
so testing on them isn't an option.

To sum up, I have 3 nodes on 3.7.6 with no corruption happening but huge freezes during heals,
and 3 other nodes on 3.7.11 with no freezes during heal but corruption. qemu-img doesn't see the
corruption, it only shows on the VM's screen and seems mostly harmless, but sometimes the VM
does switch to read-only mode saying it had too many I/O errors.

Would the bitrot detection deamon detect a hardware problem ? I did enable it but it didn't
detect anything, although I don't know how to force a check on it, no idea if it ran a scrub
since the corruption happened.

On Thu, May 19, 2016 at 04:04:49PM -0400, Alastair Neil wrote:
>    I am slightly confused you say you have image file corruption but then you
>    say the qemu-img check says there is no corruption.A  If what you mean is
>    that you see I/O errors during a heal this is likely to be due to io
>    starvation, something that is a well know issue.
>    There is work happening to improve this in version 3.8:
>    https://bugzilla.redhat.com/show_bug.cgi?id=1269461
>    On 19 May 2016 at 09:58, Kevin Lemonnier <lemonnierk at ulrar.net> wrote:
> 
>      That's a different problem then, I have corruption without removing or
>      adding bricks,
>      as mentionned. Might be two separate issue
> 
>      On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
>      >A  A  On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
>      >
>      >A  A  A  One thought - since the VM's are active while the brick is
>      >A  A  A  removed/re-added, could it be the shards that are written
>      while the
>      >A  A  A  brick is added that are the reverse healing shards?
>      >
>      >A  A  I tested by:
>      >
>      >A  A  - removing brick 3
>      >
>      >A  A  - erasing brick 3
>      >
>      >A  A  - closing down all VM's
>      >
>      >A  A  - adding new brick 3
>      >
>      >A  A  - waiting until heal number reached its max and started
>      decreasing
>      >
>      >A  A  A  There were no reverse heals
>      >
>      >A  A  - Started the VM's backup. No real issues there though one showed
>      IO
>      >A  A  errors, presumably due to shards being locked as they were
>      healed.
>      >
>      >A  A  - VM's started ok, no reverse heals were noted and eventually
>      Brick 3 was
>      >A  A  fully healed. The VM's do not appear to be corrupted.
>      >
>      >A  A  So it would appear the problem is adding a brick while the volume
>      is being
>      >A  A  written to.
>      >
>      >A  A  Cheers,
>      >
>      >A  --
>      >A  Lindsay Mathieson
> 
>      > _______________________________________________
>      > Gluster-users mailing list
>      > Gluster-users at gluster.org
>      > http://www.gluster.org/mailman/listinfo/gluster-users
> 
>      --
>      Kevin Lemonnier
>      PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>      _______________________________________________
>      Gluster-users mailing list
>      Gluster-users at gluster.org
>      http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160520/9ab03e0d/attachment.sig>