[Gluster-users] VM disks corruption on 3.7.11

Fri May 20 00:54:33 UTC 2016

*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284

On Thu, May 19, 2016 at 7:25 PM, Kevin Lemonnier <lemonnierk at ulrar.net>
wrote:

> The I/O errors are happening after, not during the heal.
> As described, I just rebooted a node, waited for the heal to finish,
> rebooted another, waited for the heal to finish then rebooted the third.
> From that point, the VM just has a lot of I/O errors showing whenever I
> use the disk a lot (importing big MySQL dumps). The VM "screen" on the
> console
> tab of proxmox just spams I/O errors from that point, which it didn't
> before rebooting
> the gluster nodes. Tried to poweroff the VM and force full heals, but I
> didn't find
> a way to fix the problem short of deleting the VM disk and restoring it
> from a backup.
>
> I have 3 other servers on 3.7.6 where that problem isn't happening, so it
> might be a 3.7.11 bug,
> but since the raid card failed recently on one of the nodes I'm not really
> sure some other
> piece of hardware isn't at fault .. Unfortunatly I don't have the hardware
> to test that.
> The only way to be sure would be to upgrade the 3.7.6 nodes to 3.7.11 and
> repeat the same tests,
> but those nodes are in production and the VM freezes during the heal last
> month already
> caused huge problems for our clients, really can't afford any other
> problems there,
> so testing on them isn't an option.
>
>
Are the 3.7.11 nodes in production?  Could they be downgraded to 3.7.6 and
see if problem still occurs?

> To sum up, I have 3 nodes on 3.7.6 with no corruption happening but huge
> freezes during heals,
> and 3 other nodes on 3.7.11 with no freezes during heal but corruption.
> qemu-img doesn't see the
> corruption, it only shows on the VM's screen and seems mostly harmless,
> but sometimes the VM
> does switch to read-only mode saying it had too many I/O errors.
>
> Would the bitrot detection deamon detect a hardware problem ? I did enable
> it but it didn't
> detect anything, although I don't know how to force a check on it, no idea
> if it ran a scrub
> since the corruption happened.
>
>
> On Thu, May 19, 2016 at 04:04:49PM -0400, Alastair Neil wrote:
> >    I am slightly confused you say you have image file corruption but
> then you
> >    say the qemu-img check says there is no corruption.A  If what you
> mean is
> >    that you see I/O errors during a heal this is likely to be due to io
> >    starvation, something that is a well know issue.
> >    There is work happening to improve this in version 3.8:
> >    https://bugzilla.redhat.com/show_bug.cgi?id=1269461
> >    On 19 May 2016 at 09:58, Kevin Lemonnier <lemonnierk at ulrar.net>
> wrote:
> >
> >      That's a different problem then, I have corruption without removing
> or
> >      adding bricks,
> >      as mentionned. Might be two separate issue
> >
> >      On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
> >      >A  A  On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
> >      >
> >      >A  A  A  One thought - since the VM's are active while the brick is
> >      >A  A  A  removed/re-added, could it be the shards that are written
> >      while the
> >      >A  A  A  brick is added that are the reverse healing shards?
> >      >
> >      >A  A  I tested by:
> >      >
> >      >A  A  - removing brick 3
> >      >
> >      >A  A  - erasing brick 3
> >      >
> >      >A  A  - closing down all VM's
> >      >
> >      >A  A  - adding new brick 3
> >      >
> >      >A  A  - waiting until heal number reached its max and started
> >      decreasing
> >      >
> >      >A  A  A  There were no reverse heals
> >      >
> >      >A  A  - Started the VM's backup. No real issues there though one
> showed
> >      IO
> >      >A  A  errors, presumably due to shards being locked as they were
> >      healed.
> >      >
> >      >A  A  - VM's started ok, no reverse heals were noted and eventually
> >      Brick 3 was
> >      >A  A  fully healed. The VM's do not appear to be corrupted.
> >      >
> >      >A  A  So it would appear the problem is adding a brick while the
> volume
> >      is being
> >      >A  A  written to.
> >      >
> >      >A  A  Cheers,
> >      >
> >      >A  --
> >      >A  Lindsay Mathieson
> >
> >      > _______________________________________________
> >      > Gluster-users mailing list
> >      > Gluster-users at gluster.org
> >      > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> >      --
> >      Kevin Lemonnier
> >      PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >      _______________________________________________
> >      Gluster-users mailing list
> >      Gluster-users at gluster.org
> >      http://www.gluster.org/mailman/listinfo/gluster-users
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160519/a7ce93e4/attachment.html>