[Bugs] [Bug 1236065] New: Disperse volume: FUSE I/O error after self healing the failed disk files

Fri Jun 26 13:02:21 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1236065

            Bug ID: 1236065
           Summary: Disperse volume: FUSE I/O error after self healing the
                    failed disk files
           Product: GlusterFS
           Version: mainline
         Component: disperse
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: xhernandez at datalab.es
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    mdfakkeer at gmail.com
        Depends On: 1235964

+++ This bug was initially created as a clone of Bug #1235964 +++

Description of problem:
In a 3 x (4 + 2) = 18 distributed disperse volume, there are
input/output error of some files on fuse mount after simulating the
following scenario

1.   Simulate the disk failure by killing the disk pid and again adding
the same disk after formatting the drive
2.   Try to read the recovered or healed file after 2 bricks/nodes were
brought down
Version-Release number of selected component (if applicable):

glusterfs 3.7.2 built on Jun 19 2015 16:33:27
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License

How reproducible:

Steps to Reproduce:

1. create a 3x(4+2) disperse volume across nodes
2. FUSE mount on the client and start creating files/directories with mkdir and
rsync/dd
3. simulate the disk failure by killing pid of any disk on one node and add
again the same disk after formatting the drive
4. start volume by force
5. self haling adding the file name with 0 bytes in newly formatted drive
6. wait more time to finish self healing, but self healing is not happening the
file lies on 0 bytes
7. Try to read same file from client, now the file name with 0 byte try to
recovery and recovery completed. Get the md5sum of the file with all client
live and the result is positive
8. Now, bring down 2 of the node
9. Now try to get the mdsum of same recoverd file, client throws I/O error

Actual results:
I/O error on the recovered file

Expected results:
There should not be any IO erro

Additional info:

admin at node001:~$ sudo gluster volume info

Volume Name: vaulttest21
Type: Distributed-Disperse
Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.1:/media/disk2
Brick8: 10.1.2.2:/media/disk2
Brick9: 10.1.2.3:/media/disk2
Brick10: 10.1.2.4:/media/disk2
Brick11: 10.1.2.5:/media/disk2
Brick12: 10.1.2.6:/media/disk2
Brick13: 10.1.2.1:/media/disk3
Brick14: 10.1.2.2:/media/disk3
Brick15: 10.1.2.3:/media/disk3
Brick16: 10.1.2.4:/media/disk3
Brick17: 10.1.2.5:/media/disk3
Brick18: 10.1.2.6:/media/disk3
Options Reconfigured:
performance.readdir-ahead: on

*_After simulated the disk failure( node3- disk2) and adding aging by
formatting the drive _*

admin at node003:~$ date

Thu Jun 25 *16:21:58* IST 2015

admin at node003:~$ ls -l -h /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

--

admin at node003:~$ date

Thu Jun 25 *16:25:09* IST 2015

admin at node003:~$ ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*

*-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

admin at node003:~$ date

Thu Jun 25 *16:41:25* IST 2015

admin at node003:~$  ls -l -h  /media/disk2

total 1.6G

drwxr-xr-x 3 root root   22 Jun 25 16:18 1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up1

-rw-r--r-- 2 root root    0 Jun 25 16:17 up2

-rw-r--r-- 2 root root 797M Jun 25 16:03 up3

-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*after waiting nearly 20 minutes, self healing is not recovered the full
data junk . Then try to read the file using md5sum*
*
*
root at mas03:/mnt/gluster# time md5sum up1
4650543ade404ed5a1171726e76f8b7c  up1

real    1m58.010s
user    0m6.243s
sys     0m0.778s

*corrupted junk starts growing*

admin at node003:~$ ls -l -h  /media/disk2
total 2.6G
drwxr-xr-x 3 root root   22 Jun 25 16:18 1
-rw-r--r-- 2 root root 797M Jun 25 15:57 up1
-rw-r--r-- 2 root root    0 Jun 25 16:17 up2
-rw-r--r-- 2 root root 797M Jun 25 16:03 up3
-rw-r--r-- 2 root root 797M Jun 25 16:04 up4

*_To verify healed file after two node 5 & 6 taken offline_*

root at mas03:/mnt/gluster# time md5sum up1
md5sum: up1:*Input/output error*

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1235964
[Bug 1235964] Disperse volume: FUSE I/O error after self healing the failed
disk files
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.