[Bugs] [Bug 1236065] New: Disperse volume: FUSE I/O error after self healing the failed disk files
bugzilla at redhat.com
bugzilla at redhat.com
Fri Jun 26 13:02:21 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1236065
Bug ID: 1236065
Summary: Disperse volume: FUSE I/O error after self healing the
failed disk files
Product: GlusterFS
Version: mainline
Component: disperse
Severity: high
Assignee: bugs at gluster.org
Reporter: xhernandez at datalab.es
CC: bugs at gluster.org, gluster-bugs at redhat.com,
mdfakkeer at gmail.com
Depends On: 1235964
+++ This bug was initially created as a clone of Bug #1235964 +++
Description of problem:
In a 3 x (4 + 2) = 18 distributed disperse volume, there are
input/output error of some files on fuse mount after simulating the
following scenario
1. Simulate the disk failure by killing the disk pid and again adding
the same disk after formatting the drive
2. Try to read the recovered or healed file after 2 bricks/nodes were
brought down
Version-Release number of selected component (if applicable):
glusterfs 3.7.2 built on Jun 19 2015 16:33:27
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License
How reproducible:
Steps to Reproduce:
1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and
rsync/dd
3. simulate the disk failure by killing pid of any disk on one node and add
again the same disk after formatting the drive
4. start volume by force
5. self haling adding the file name with 0 bytes in newly formatted drive
6. wait more time to finish self healing, but self healing is not happening the
file lies on 0 bytes
7. Try to read same file from client, now the file name with 0 byte try to
recovery and recovery completed. Get the md5sum of the file with all client
live and the result is positive
8. Now, bring down 2 of the node
9. Now try to get the mdsum of same recoverd file, client throws I/O error
Actual results:
I/O error on the recovered file
Expected results:
There should not be any IO erro
Additional info:
admin at node001:~$ sudo gluster volume info
Volume Name: vaulttest21
Type: Distributed-Disperse
Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.1:/media/disk2
Brick8: 10.1.2.2:/media/disk2
Brick9: 10.1.2.3:/media/disk2
Brick10: 10.1.2.4:/media/disk2
Brick11: 10.1.2.5:/media/disk2
Brick12: 10.1.2.6:/media/disk2
Brick13: 10.1.2.1:/media/disk3
Brick14: 10.1.2.2:/media/disk3
Brick15: 10.1.2.3:/media/disk3
Brick16: 10.1.2.4:/media/disk3
Brick17: 10.1.2.5:/media/disk3
Brick18: 10.1.2.6:/media/disk3
Options Reconfigured:
performance.readdir-ahead: on
*_After simulated the disk failure( node3- disk2) and adding aging by
formatting the drive _*
admin at node003:~$ date
Thu Jun 25 *16:21:58* IST 2015
admin at node003:~$ ls -l -h /media/disk2
total 1.6G
drwxr-xr-x 3 root root 22 Jun 25 16:18 1
*-rw-r--r-- 2 root root 0 Jun 25 16:17 up1*
*-rw-r--r-- 2 root root 0 Jun 25 16:17 up2*
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4
--
admin at node003:~$ date
Thu Jun 25 *16:25:09* IST 2015
admin at node003:~$ ls -l -h /media/disk2
total 1.6G
drwxr-xr-x 3 root root 22 Jun 25 16:18 1
*-rw-r--r-- 2 root root 0 Jun 25 16:17 up1*
*-rw-r--r-- 2 root root 0 Jun 25 16:17 up2*
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4
admin at node003:~$ date
Thu Jun 25 *16:41:25* IST 2015
admin at node003:~$ ls -l -h /media/disk2
total 1.6G
drwxr-xr-x 3 root root 22 Jun 25 16:18 1
-rw-r--r-- 2 root root 0 Jun 25 16:17 up1
-rw-r--r-- 2 root root 0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4
*after waiting nearly 20 minutes, self healing is not recovered the full
data junk . Then try to read the file using md5sum*
*
*
root at mas03:/mnt/gluster# time md5sum up1
4650543ade404ed5a1171726e76f8b7c up1
real 1m58.010s
user 0m6.243s
sys 0m0.778s
*corrupted junk starts growing*
admin at node003:~$ ls -l -h /media/disk2
total 2.6G
drwxr-xr-x 3 root root 22 Jun 25 16:18 1
-rw-r--r-- 2 root root 797M Jun 25 15:57 up1
-rw-r--r-- 2 root root 0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4
*_To verify healed file after two node 5 & 6 taken offline_*
root at mas03:/mnt/gluster# time md5sum up1
md5sum: up1:*Input/output error*
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1235964
[Bug 1235964] Disperse volume: FUSE I/O error after self healing the failed
disk files
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list