[Bugs] [Bug 1149118] New: Spurious failure on disperse tests (bad file size on brick)

Fri Oct 3 09:14:51 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1149118

            Bug ID: 1149118
           Summary: Spurious failure on disperse tests (bad file size on
                    brick)
           Product: GlusterFS
           Version: 3.6.0
         Component: disperse
          Assignee: gluster-bugs at redhat.com
          Reporter: xhernandez at datalab.es
                CC: bugs at gluster.org

+++ This bug was initially created as a clone of Bug #1144108 +++

Description of problem:

Sometimes, specially on NetBSD, ec test scripts fail because the size of a file
on one of the bricks has an incorrect size.

Version-Release number of selected component (if applicable): master

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

This is caused by a side effect of the read of a file though the mount point
when a brick is down. This may generate an update of the access time, leaving
the not running brick in an invalid state. This is correctly healed by
self-heal, but the script was not giving enough time to self-heal to complete.

--- Additional comment from Anand Avati on 2014-09-18 18:47:42 CEST ---

REVIEW: http://review.gluster.org/8771 (test/ec: Let self-heal repair files
before accessing bricks) posted (#1) for review on master by Xavier Hernandez
(xhernandez at datalab.es)

--- Additional comment from Anand Avati on 2014-09-30 18:46:39 CEST ---

REVIEW: http://review.gluster.org/8892 (test/ec: Fix spurious failures caused
by self-heal) posted (#1) for review on master by Xavier Hernandez
(xhernandez at datalab.es)

--- Additional comment from Anand Avati on 2014-10-01 11:27:44 CEST ---

REVIEW: http://review.gluster.org/8892 (test/ec: Fix spurious failures caused
by self-heal) posted (#2) for review on master by Xavier Hernandez
(xhernandez at datalab.es)

--- Additional comment from Anand Avati on 2014-10-03 11:01:30 CEST ---

COMMIT: http://review.gluster.org/8892 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit a97ad9b69bb17f2351c59512fa9c6cb25d82b4da
Author: Xavier Hernandez <xhernandez at datalab.es>
Date:   Thu Sep 18 18:42:34 2014 +0200

    test/ec: Fix spurious failures caused by self-heal

    The sha1sum of a file may update the access time of that file.
    If this happens while a brick is down, as it is forced in the
    test, that brick doesn't get the update, getting out of sync.

    When the brick is restarted, self-heal repairs the file, but
    the test shouldn't access brick contents until self-heal finishes.
    If this is combined with a kill of another brick before self-heal
    has finished repairing the file, the volume could become inaccessible.

    Since the purpose of these tests is only to check ec functionality
    (there is another test that checks self-heal), the test that corrupts
    the file has been removed.

    Additional checks to validate the state of the volume have been added
    to avoid some timing issues.

    BUG: 1144108
    Change-Id: Ibd9288de519914663998a1fbc4321ec92ed6082c
    Signed-off-by: Xavier Hernandez <xhernandez at datalab.es>
    Reviewed-on: http://review.gluster.org/8892
    Reviewed-by: Emmanuel Dreyfus <manu at netbsd.org>
    Tested-by: Emmanuel Dreyfus <manu at netbsd.org>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Dan Lambright <dlambrig at redhat.com>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=ctcespuxZ3&a=cc_unsubscribe