[Bugs] [Bug 1749305] New: Failures in remove-brick due to [Input/output error] errors
bugzilla at redhat.com
bugzilla at redhat.com
Thu Sep 5 10:26:28 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1749305
Bug ID: 1749305
Summary: Failures in remove-brick due to [Input/output error]
errors
Product: GlusterFS
Version: 7
Status: NEW
Component: replicate
Severity: high
Assignee: bugs at gluster.org
Reporter: rkavunga at redhat.com
CC: bugs at gluster.org, ksubrahm at redhat.com,
nchilaka at redhat.com, rhs-bugs at redhat.com,
rkavunga at redhat.com, saraut at redhat.com,
spalai at redhat.com, storage-qa-internal at redhat.com
Depends On: 1726673, 1728770
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1728770 +++
+++ This bug was initially created as a clone of Bug #1726673 +++
Description of problem:
While performing remove-brick to convert 3X3 volume to 2X3 volume, there were
failures in remove-brick rebalance due to " E [MSGID: 114031]
[client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-vol4-client-8: remote
operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43
(69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error] "
Version-Release number of selected component (if applicable):
6.0.7
How reproducible:
1/1
Steps to Reproduce:
1. Created 1X3 volume.
2. Fuse mount the volume and start I/O on the volume.
3. Convert it into 2X3 volume, triggered rebalance.
4. Let the rebalance complete and then convert into 3X3 volume;triggered
rebalance.
5. After that, started remove-brick operation on the volume to convert it back
into 2X3 volume.
6. Check the remove-brick status.
Actual results:
There are failures in remove-brick rebalance.
Errors from rebalance logs:
E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk]
0-vol4-client-2: remote operation failed. Path:
/dir1/thread0/level03/level13/level23/level33/level43
(69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]
E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk]
0-vol4-client-8: remote operation failed. Path:
/dir1/thread0/level03/level13/level23/level33/level43
(69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]
W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
0-vol4-client-8: remote operation failed. Path:
/dir1/thread0/level03/level13/level23/level33/level43/level53/5d1b1579%%P3TRO7PG35
(558423e2-478e-40e9-9958-31c710e50b89) [Input/output error]
W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
0-vol4-client-2: remote operation failed. Path:
/dir1/thread0/level03/level13/level23/level33/level43
(69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]
Expected results:
Remove-brick should complete successfully.
Remove-brick rebalance status:
==============================
# gluster v remove-brick vol4 replica 3 10.70.47.88:/bricks/brick2/vol4-b2
10.70.47.190:/bricks/brick2/vol4-b2 10.70.47.5:/bricks/brick2/vol4-b2 status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
10.70.47.190 3463 3.5MB
18425 23 0 completed 0:37:14
10.70.47.5 3308 3.7MB
21920 136 0 completed 0:32:59
localhost 3397 3.3MB
21977 138 0 completed 0:33:35
On checking the volume status, it showed that two bricks are down:
=================================================================
# gluster v status vol4
Status of volume: vol4
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.47.88:/bricks/brick2/vol4-b1 49159 0 Y 30394
Brick 10.70.47.190:/bricks/brick2/vol4-b1 49159 0 Y 29191
Brick 10.70.47.5:/bricks/brick2/vol4-b1 N/A N/A N N/A
Brick 10.70.46.246:/bricks/brick2/vol4-b1 49158 0 Y 22598
Brick 10.70.47.188:/bricks/brick2/vol4-b1 49158 0 Y 22865
Brick 10.70.46.63:/bricks/brick2/vol4-b1 49158 0 Y 21036
Brick 10.70.47.88:/bricks/brick2/vol4-b2 49160 0 Y 5938
Brick 10.70.47.190:/bricks/brick2/vol4-b2 49160 0 Y 4825
Brick 10.70.47.5:/bricks/brick2/vol4-b2 N/A N/A N N/A
Self-heal Daemon on localhost N/A N/A Y 6330
Self-heal Daemon on 10.70.46.246 N/A N/A Y 5672
Self-heal Daemon on 10.70.47.5 N/A N/A Y 5600
Self-heal Daemon on 10.70.46.63 N/A N/A Y 4593
Self-heal Daemon on 10.70.47.188 N/A N/A Y 4501
Self-heal Daemon on 10.70.47.190 N/A N/A Y 5352
Task Status of Volume vol4
------------------------------------------------------------------------------
Task : Remove brick
ID : 273f04c3-b8bb-4613-a403-0c655de86ca3
Removed bricks:
10.70.47.88:/bricks/brick2/vol4-b2
10.70.47.190:/bricks/brick2/vol4-b2
10.70.47.5:/bricks/brick2/vol4-b2
Status : completed
dmesg:
=====
[161039.214245] XFS (dm-66): Metadata CRC error detected at
xfs_dir3_block_read_verify+0x5e/0x110 [xfs], xfs_dir3_block block 0x1dd8568
[161039.214912] XFS (dm-66): Unmount and run xfs_repair
[161039.215126] XFS (dm-66): First 64 bytes of corrupted metadata buffer:
[161039.215426] ffffbb1db27a6000: 20 20 20 20 20 23 20 51 75 69 63 6b 20 4d 61
69 # Quick Mai
[161039.215729] ffffbb1db27a6010: 6c 20 54 72 61 6e 73 66 65 72 20 50 72 6f 74
6f l Transfer Proto
[161039.216110] ffffbb1db27a6020: 63 6f 6c 0a 71 6d 74 70 20 20 20 20 20 20 20
20 col.qmtp
[161039.216527] ffffbb1db27a6030: 20 20 20 20 32 30 39 2f 75 64 70 20 20 20 20
20 209/udp
[161039.217200] XFS (dm-66): metadata I/O error: block 0x1dd8568
("xfs_trans_read_buf_map") error 74 numblks 16
[161039.217937] XFS (dm-66): xfs_do_force_shutdown(0x1) called from line 370 of
file fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffc057de9a
[161039.344196] XFS (dm-66): I/O Error Detected. Shutting down filesystem
[161039.344495] XFS (dm-66): Please umount the filesystem and rectify the
problem(s)
---> Though due to the brick issue, one brick is down in two replica pairs of
the volume, but as it is a distributed-replicated volume,there should not be
failures in rebalance.
Failure reason:
"[2019-07-02 08:32:01.514139] W [MSGID: 109023]
[dht-rebalance.c:626:__is_file_migratable] 0-vol4-dht: Mi
grate file
failed:/dir1/thread0/level04/level14/level24/level34/level44/level54/level64/level74/level84/
symlink_to_files/5d1b15ed%%XS3OMQKQBN: Unable to get lock count for file
"
Key:/GLUSTERFS_POSIXLK_COUNT is used to get lock count from posix-lock
translator. This information is used to decide whether to migrate the file or
not.
In the current scenario as Sayalee mentioned one disk is corrupted on server
*.5 rendering both participating brick from that server unresponsive(all
operation leading to IO error). Given that only of the brick from two replicas
was down, DHT should have received a valid response. Actually, the key was
entirely missing from the dictionary itself.
Moving to AFR component for analysis.
Adding a needinfo on Rafi, as he had done some investigation on the same.
--- Additional comment from Mohammed Rafi KC on 2019-07-10 16:08:22 UTC ---
RCA:
As mentioned in the comment6, it failed because the lookup couldn't return lock
count requested through GLUSTERFS_POSIXLK_COUNT. This is because While
processing afr_lookup_cbk, if it requires a name heal, we process the name heal
in afr_lookup_selfheal_wrap by wiping all the current lookup data. And after
finishing the lookup we return the fresh data. But here when doing the healing
using lookup we are not passing the xdata_req, which then posix misses to
populate lock count.
<code>
2802 int
2803 afr_lookup_selfheal_wrap(void *opaque)
2804 {
2805 int ret = 0;
2806 call_frame_t *frame = opaque;
2807 afr_local_t *local = NULL;
2808 xlator_t *this = NULL;
2809 inode_t *inode = NULL;
2810 uuid_t pargfid = {
2811 0,
2812 };
2813
2814 local = frame->local;
2815 this = frame->this;
2816 loc_pargfid(&local->loc, pargfid);
2817
2818 ret = afr_selfheal_name(frame->this, pargfid, local->loc.name,
2819 &local->cont.lookup.gfid_req,
local->xattr_req);
2820 if (ret == -EIO)
2821 goto unwind;
2822
2823 afr_local_replies_wipe(local, this->private);
2824
2825 inode = afr_selfheal_unlocked_lookup_on(frame, local->loc.parent,
2826 local->loc.name,
local->replies,
2827 local->child_up, NULL);
2828 if (inode)
2829 inode_unref(inode);
2830
2831 afr_lookup_metadata_heal_check(frame, this);
2832 return 0;
2833
2834 unwind:
2835 AFR_STACK_UNWIND(lookup, frame, -1, EIO, NULL, NULL, NULL, NULL);
2836 return 0;
</code>
--- Additional comment from Worker Ant on 2019-07-10 16:22:14 UTC ---
REVIEW: https://review.gluster.org/23024 (afr/lookup: Pass xattr_req in while
doing a slefheal in lookup) posted (#1) for review on master by mohammed rafi
kc
--- Additional comment from Worker Ant on 2019-09-05 09:53:57 UTC ---
REVIEW: https://review.gluster.org/23024 (afr/lookup: Pass xattr_req in while
doing a selfheal in lookup) merged (#15) on master by Ravishankar N
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1726673
[Bug 1726673] Failures in remove-brick due to [Input/output error] errors
https://bugzilla.redhat.com/show_bug.cgi?id=1728770
[Bug 1728770] Failures in remove-brick due to [Input/output error] errors
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list