[Bugs] [Bug 1408171] New: VM pauses due to storage I/O error, when one of the data brick is down with arbiter/replica volume

Thu Dec 22 11:08:11 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1408171

            Bug ID: 1408171
           Summary: VM pauses due to storage I/O error, when one of the
                    data brick is down with arbiter/replica volume
           Product: GlusterFS
           Version: 3.9
         Component: replicate
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org
        Depends On: 1404982, 1406224



+++ This bug was initially created as a clone of Bug #1406224 +++

+++ This bug was initially created as a clone of Bug #1404982 +++

Description of problem:
In a arbiter volume when one of the data brick is killed and start writing I/O
i see that vm goes to paused state and following is seen in the mount logs.

[2016-12-15 09:47:16.357700] E [MSGID: 108008]
[afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing
FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016]
[shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size
xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956
fd=0x7fd0f000f0f8 (Input/o
utput error)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-8.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC with three nodes.
2. Inside the vm mount the disk and start writing I/O
3. while I/O is going on kill one of the brick.

Actual results:
I see that vm goes to paused state with Input/output error.

Expected results:
vm should not go to paused state as only one of the data brick is down.

Additional info:

Following is seen in the mount logs:
===========================================
[2016-12-15 09:47:16.357700] E [MSGID: 108008]
[afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing
FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016]
[shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size
xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956
fd=0x7fd0f000f0f8 (Input/o
utput error)



 Additional comment from Ravishankar N on 2016-12-19 11:25:20 EST ---

Thanks Kasturi for providing the setup for testing and thanks Satheesaran for
providing virsh based commands for re-creating the issue.

The isse is due to a race between inode_refresh_done() and
__afr_set_in_flight_sb_status() that occurs when I/O is going on and a brick is
brought down or up. When the brick goes up/ comes down, inode refresh is
triggered in the write transaction and sets the correct data/metadata readable
and event_generation in inode_refresh_done(). But before it can proceed to the
write FOP, __afr_set_in_flight_sb_status() from another writev cbk resets the
event_generation. When the first write (that follows the inode refresh) gets
the event_gen in afr_inode_get_readable(), it gets zero because of which it
fails the write with EIO.

While ignoring event_generation seems to fix the issue
-----------------------------------------------------------

diff --git a/xlators/cluster/afr/src/afr-common.c
b/xlators/cluster/afr/src/afr-common.c
index 60bae18..2f32e44 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -1089,7 +1089,7 @@ afr_txn_refresh_done (call_frame_t *frame, xlator_t
*this, int err)
                                       &event_generation,
                                       local->transaction.type);

-        if (ret == -EIO || !event_generation) {
+        if (ret == -EIO){
                 /* No readable subvolume even after refresh ==> splitbrain.*/
                 if (!priv->fav_child_policy) {
                         err = -EIO;
-----------------------------------------------------------
I need to convince myself that ignoring event gen in afr_txn_refresh_done() is
for reads (there is no prob in ignoring it for writes)does not have any
repercussions.

--- Additional comment from Ravishankar N on 2016-12-19 20:34:53 EST ---

Before http://review.gluster.org/#/c/15673/, after inode refresh, we failed
read txns in case of EIO or event_generation being zero. For write
transactions, the check was only for EIO. 15673 re-factored the code to fail
both read and write when event_generation=0. This seems to have caused a
regression as explained in the description above.

While we could restore the above behaviour, it seems we don't need to check
event_gen value for read transactions as well because it could very well happen
that the event_gen could be set to zero after we checked (post inode refresh)
for it to be non zero but just before we did a stack wind for that read txn.

Send a patch to see if this breaks any upstream regression test.

--- Additional comment from Worker Ant on 2016-12-19 20:36:14 EST ---

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks
post inode refresh) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-12-21 09:01:46 EST ---

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks
post inode refresh) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-12-22 01:03:12 EST ---

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks
post inode refresh) posted (#3) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-12-22 01:13:43 EST ---

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks
post inode refresh for write txns) posted (#4) for review on master by
Ravishankar N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-12-22 06:06:35 EST ---

COMMIT: http://review.gluster.org/16205 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 7ee998b9041d594d93a4e2ef369892c185e80def
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Tue Dec 20 07:05:02 2016 +0530

    afr: Ignore event_generation checks post inode refresh for write txns

    Before http://review.gluster.org/#/c/15673/, after inode refresh, we
    failed read txns in case of EIO or event_generation being zero. For
    write transactions, the check was only for EIO. 15673 re-factored the
    code to fail both read and write when event_generation=0. This seems to
    have caused a regression as explained in the BZ.

    This patch restores that behaviour in afr_txn_refresh_done().

    Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744
    BUG: 1406224
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>
    Reviewed-on: http://review.gluster.org/16205
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1404982
[Bug 1404982] VM pauses due to storage I/O error, when one of the data
brick is down with arbiter volume/replica volume
https://bugzilla.redhat.com/show_bug.cgi?id=1406224
[Bug 1406224] VM pauses due to storage I/O error, when one of the data
brick is down with arbiter/replica volume
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.