[Bugs] [Bug 1235964] New: Disperse volume: FUSE I/O error after self healing the disk failure files

Fri Jun 26 08:32:14 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1235964

            Bug ID: 1235964
           Summary: Disperse volume: FUSE I/O error after self healing the
                    disk failure files
           Product: GlusterFS
           Version: 3.7.2
         Component: disperse
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: mdfakkeer at gmail.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
In a 3 x (4 + 2) = 18 distributed disperse volume, there are
input/output error of some files on fuse mount after simulating the
following scenario

1.   Simulate the disk failure by killing the disk pid and again adding
the same disk after formatting the drive
2.   Try to read the recovered or healed file after 2 bricks/nodes were
brought down
Version-Release number of selected component (if applicable):

glusterfs 3.7.2 built on Jun 19 2015 16:33:27
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License

How reproducible:

Steps to Reproduce:

1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and
rsync/dd
3. simulate the disk failure by killing pid of any disk on one node and add
again the same disk after formatting the drive
4. start volume by force
5. self haling adding the file name with 0 bytes in newly formatted drive
6. wait more time to finish self healing, but self healing is not happening the
file lies on 0 bytes
7. Try to read same file from client, now the file name with 0 byte try to
recovery and recovery completed. Get the md5sum of the file with all client
live and the result is positive
8. Now, bring down 2 of the node
9. Now try to get the mdsum of same recoverd file, client throws I/O error

Actual results:
I/O error on the recovered file

Expected results:
There should not be any IO erro

Additional info:

admin at node001:~$ sudo gluster volume info

Volume Name: vaulttest21
Type: Distributed-Disperse
Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.1:/media/disk2
Brick8: 10.1.2.2:/media/disk2
Brick9: 10.1.2.3:/media/disk2
Brick10: 10.1.2.4:/media/disk2
Brick11: 10.1.2.5:/media/disk2
Brick12: 10.1.2.6:/media/disk2
Brick13: 10.1.2.1:/media/disk3
Brick14: 10.1.2.2:/media/disk3
Brick15: 10.1.2.3:/media/disk3
Brick16: 10.1.2.4:/media/disk3
Brick17: 10.1.2.5:/media/disk3
Brick18: 10.1.2.6:/media/disk3
Options Reconfigured:
performance.readdir-ahead: on

*_After simulated the disk failure( node3- disk2) and adding aging by
formatting the drive _*

admin at node003:~$ date

Thu Jun 25 *16:21:58* IST 2015

admin at node003:~$ ls -l -h /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

--

admin at node003:~$ date

Thu Jun 25 *16:25:09* IST 2015

admin at node003:~$ ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

admin at node003:~$ date

Thu Jun 25 *16:41:25* IST 2015

admin at node003:~$  ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up2

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*after waiting nearly 20 minutes, self healing is not recovered the full
data junk . Then try to read the file using md5sum*
*
*
root at mas03:/mnt/gluster# time md5sum up1
4650543ade404ed5a1171726e76f8b7c  up1

real    1m58.010s
user    0m6.243s
sys     0m0.778s

*corrupted junk starts growing*

admin at node003:~$ ls -l -h  /media/disk2
total 2.6G
drwxr-xr-x 3 root root   22 Jun 25 16:18 1
-rw-r--r-- 2 root root 797M Jun 25 15:57 up1
-rw-r--r-- 2 root root    0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*_To verify healed file after two node 5 & 6 taken offline_*

root at mas03:/mnt/gluster# time md5sum up1
md5sum: up1:*Input/output error*

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.