[Bugs] [Bug 1245934] New: [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initiating remove-brick operation
bugzilla at redhat.com
bugzilla at redhat.com
Thu Jul 23 07:13:26 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1245934
Bug ID: 1245934
Summary: [RHEV-RHGS] App VMs paused due to IO error caused by
split-brain, after initiating remove-brick operation
Product: GlusterFS
Version: 3.7.3
Component: distribute
Keywords: Triaged
Severity: urgent
Assignee: bugs at gluster.org
Reporter: ravishankar at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
nbalacha at redhat.com, ravishankar at redhat.com,
rcyriac at redhat.com, rgowdapp at redhat.com,
sasundar at redhat.com, ssampat at redhat.com
Depends On: 1243542, 1244165
Description of problem:
------------------------
With 8X2 distributed replicate volume, initiated a remove-brick with data
migration. After few minutes, all the application VMs with its disk image on
that gluster volume went in to paused state.
Noticed the split-brain error message in the fuse mount log
Version
--------
RHEL 6.7 as hypervisor
RHGS 3.1 based on RHEL 7.1
How reproducible:
-----------------
Tried only once
Steps to Reproduce:
-------------------
1. Create a 2X2 distributed replicate volume
2. Use this gluster volume as the 'Data Domain' for RHEV
3. Create few App VMs and install OS
4. Remove the bricks where the disk image of App VMs are residing
Actual results:
----------------
App VMs went in to **paused** state
Expected results:
-----------------
App VMs should be healthy
--- Additional comment from SATHEESARAN on 2015-07-15 14:17:43 EDT ---
Following error messages are seen in the fuse mount logs:
[2015-07-15 17:49:42.709088] E [MSGID: 114031]
[client-rpc-fops.c:1673:client3_3_finodelk_cbk] 6-vol1-client-0: remote
operation failed [Transport endpoint is not connected]
[2015-07-15 17:49:42.710849] W [MSGID: 114031]
[client-rpc-fops.c:1028:client3_3_fsync_cbk] 6-vol1-client-0: remote operation
failed [Transport endpoint is not connected]
[2015-07-15 17:49:42.710874] W [MSGID: 108035]
[afr-transaction.c:1614:afr_changelog_fsync_cbk] 6-vol1-replicate-0:
fsync(b7d21675-6fd8-472a-b7d9-71d7436c614d) failed on subvolume vol1-client-0.
Transaction was WRITE [Transport endpoint is not connected]
[2015-07-15 17:49:42.710897] W [MSGID: 108001]
[afr-transaction.c:686:afr_handle_quorum] 6-vol1-replicate-0:
b7d21675-6fd8-472a-b7d9-71d7436c614d: Failing WRITE as quorum is not met
[2015-07-15 18:12:15.544061] E [MSGID: 108008]
[afr-transaction.c:1984:afr_transaction] 12-vol1-replicate-5: Failing WRITE on
gfid b7d21675-6fd8-472a-b7d9-71d7436c614d: split-brain observed. [Input/output
error]
[2015-07-15 18:12:15.737906] W [fuse-bridge.c:2273:fuse_writev_cbk]
0-glusterfs-fuse: 293197: WRITE => -1 (Input/output error)
[2015-07-15 18:12:17.022070] W [MSGID: 114031]
[client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-5: remote
operation failed. Path:
/c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b
(d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:17.022073] W [MSGID: 114031]
[client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-4: remote
operation failed. Path:
/c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b
(d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory]
[2015-07-15 18:12:22.952290] W [fuse-bridge.c:2273:fuse_writev_cbk]
0-glusterfs-fuse: 293304: WRITE => -1 (Input/output error)
[2015-07-15 18:12:22.952550] W [fuse-bridge.c:2273:fuse_writev_cbk]
0-glusterfs-fuse: 293306: WRITE => -1 (Input/output error)
--- Additional comment from Ravishankar N on 2015-07-16 07:33:59 EDT ---
Able to reproduce the issue with running a continuous `dd ` into a file from
fuse mount on a 2x2 volume and reducing it to a 1x2, making sure to remove the
replica pair in which file resides. dd terminated with EIO.
[root at vm2 fuse_mnt]# dd if=/dev/urandom of=file
dd: writing to ‘file’: Input/output error
dd: closing output file ‘file’: Input/output error
[root at vm2 fuse_mnt]#
The EIO is returned by afr_transaction() which is not able to find a readable
subvolume for the inode. I need to debug further to see why.
FWIW, there was no data corruption/loss and the migration completed
successfully. New reads/writes to the file was successful.
[root at vm2 fuse_mnt]# echo append>>file
[root at vm2 fuse_mnt]# echo $?
0
[root at vm2 fuse_mnt]# tail -1 file
��_�d�!��aappend
[root at vm2 fuse_mnt]#
[root at vm2 fuse_mnt]# echo $?
0
-
--- Additional comment from Anand Avati on 2015-07-17 06:59:43 EDT ---
REVIEW: http://review.gluster.org/11713 (dht: send lookup even for fd based
operations during rebalance) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)
--- Additional comment from Anand Avati on 2015-07-17 13:04:54 EDT ---
REVIEW: http://review.gluster.org/11713 (dht: send lookup even for fd based
operations during rebalance) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)
--- Additional comment from Anand Avati on 2015-07-19 05:24:44 EDT ---
REVIEW: http://review.gluster.org/11713 (dht: send lookup even for fd based
operations during rebalance) posted (#3) for review on master by Ravishankar N
(ravishankar at redhat.com)
--- Additional comment from Anand Avati on 2015-07-23 02:45:22 EDT ---
COMMIT: http://review.gluster.org/11713 committed in master by Raghavendra G
(rgowdapp at redhat.com)
------
commit 94372373ee355e42dfe1660a50315adb4f019d64
Author: Ravishankar N <ravishankar at redhat.com>
Date: Fri Jul 17 16:04:01 2015 +0530
dht: send lookup even for fd based operations during rebalance
Problem:
dht_rebalance_inprogress_task() was not sending lookups to the
destination subvolume for a file undergoing writes during rebalance. Due to
this, afr was not able to populate the read_subvol and failed the write
with EIO.
Fix:
Send lookup for fd based operations as well.
Thanks to Raghavendra G for helping with the RCA.
Change-Id: I638c203abfaa45b29aa5902ffd76e692a8212a19
BUG: 1244165
Signed-off-by: Ravishankar N <ravishankar at redhat.com>
Reviewed-on: http://review.gluster.org/11713
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Reviewed-by: N Balachandran <nbalacha at redhat.com>
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1243542
[Bug 1243542] [RHEV-RHGS] App VMs paused due to IO error caused by
split-brain, after initiating remove-brick operation
https://bugzilla.redhat.com/show_bug.cgi?id=1244165
[Bug 1244165] [RHEV-RHGS] App VMs paused due to IO error caused by
split-brain, after initiating remove-brick operation
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list