[Gluster-users] VM disks corruption on 3.7.11
David Gossage
dgossage at carouselchecks.com
Fri May 20 00:54:33 UTC 2016
*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284
On Thu, May 19, 2016 at 7:25 PM, Kevin Lemonnier <lemonnierk at ulrar.net>
wrote:
> The I/O errors are happening after, not during the heal.
> As described, I just rebooted a node, waited for the heal to finish,
> rebooted another, waited for the heal to finish then rebooted the third.
> From that point, the VM just has a lot of I/O errors showing whenever I
> use the disk a lot (importing big MySQL dumps). The VM "screen" on the
> console
> tab of proxmox just spams I/O errors from that point, which it didn't
> before rebooting
> the gluster nodes. Tried to poweroff the VM and force full heals, but I
> didn't find
> a way to fix the problem short of deleting the VM disk and restoring it
> from a backup.
>
> I have 3 other servers on 3.7.6 where that problem isn't happening, so it
> might be a 3.7.11 bug,
> but since the raid card failed recently on one of the nodes I'm not really
> sure some other
> piece of hardware isn't at fault .. Unfortunatly I don't have the hardware
> to test that.
> The only way to be sure would be to upgrade the 3.7.6 nodes to 3.7.11 and
> repeat the same tests,
> but those nodes are in production and the VM freezes during the heal last
> month already
> caused huge problems for our clients, really can't afford any other
> problems there,
> so testing on them isn't an option.
>
>
Are the 3.7.11 nodes in production? Could they be downgraded to 3.7.6 and
see if problem still occurs?
> To sum up, I have 3 nodes on 3.7.6 with no corruption happening but huge
> freezes during heals,
> and 3 other nodes on 3.7.11 with no freezes during heal but corruption.
> qemu-img doesn't see the
> corruption, it only shows on the VM's screen and seems mostly harmless,
> but sometimes the VM
> does switch to read-only mode saying it had too many I/O errors.
>
> Would the bitrot detection deamon detect a hardware problem ? I did enable
> it but it didn't
> detect anything, although I don't know how to force a check on it, no idea
> if it ran a scrub
> since the corruption happened.
>
>
> On Thu, May 19, 2016 at 04:04:49PM -0400, Alastair Neil wrote:
> > I am slightly confused you say you have image file corruption but
> then you
> > say the qemu-img check says there is no corruption.A If what you
> mean is
> > that you see I/O errors during a heal this is likely to be due to io
> > starvation, something that is a well know issue.
> > There is work happening to improve this in version 3.8:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1269461
> > On 19 May 2016 at 09:58, Kevin Lemonnier <lemonnierk at ulrar.net>
> wrote:
> >
> > That's a different problem then, I have corruption without removing
> or
> > adding bricks,
> > as mentionned. Might be two separate issue
> >
> > On Thu, May 19, 2016 at 11:25:34PM +1000, Lindsay Mathieson wrote:
> > >A A On 19/05/2016 12:17 AM, Lindsay Mathieson wrote:
> > >
> > >A A A One thought - since the VM's are active while the brick is
> > >A A A removed/re-added, could it be the shards that are written
> > while the
> > >A A A brick is added that are the reverse healing shards?
> > >
> > >A A I tested by:
> > >
> > >A A - removing brick 3
> > >
> > >A A - erasing brick 3
> > >
> > >A A - closing down all VM's
> > >
> > >A A - adding new brick 3
> > >
> > >A A - waiting until heal number reached its max and started
> > decreasing
> > >
> > >A A A There were no reverse heals
> > >
> > >A A - Started the VM's backup. No real issues there though one
> showed
> > IO
> > >A A errors, presumably due to shards being locked as they were
> > healed.
> > >
> > >A A - VM's started ok, no reverse heals were noted and eventually
> > Brick 3 was
> > >A A fully healed. The VM's do not appear to be corrupted.
> > >
> > >A A So it would appear the problem is adding a brick while the
> volume
> > is being
> > >A A written to.
> > >
> > >A A Cheers,
> > >
> > >A --
> > >A Lindsay Mathieson
> >
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> > --
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160519/a7ce93e4/attachment.html>
More information about the Gluster-users
mailing list