[Bugs] [Bug 1805054] New: Disperse volume : data corruption with ftruncate data in 4+2 config

Thu Feb 20 07:41:17 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1805054

            Bug ID: 1805054
           Summary: Disperse volume : data corruption with ftruncate data
                    in 4+2 config
           Product: GlusterFS
           Version: 5
            Status: NEW
         Component: disperse
          Keywords: Reopened
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: aspandey at redhat.com, bugs at gluster.org,
                    jahernan at redhat.com, kinglongmee at gmail.com,
                    pkarampu at redhat.com
        Depends On: 1727081
            Blocks: 1805051, 1730914, 1732772, 1732774, 1732778, 1732792,
                    1739424, 1739449
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1727081 +++

Description of problem:

LTP ftestxx tests reports data corruption at a 4+2 disperse volume.

<<<test_output>>>
ftest05     1  TFAIL  :  ftest05.c:395:         Test[0] bad verify @ 0x3800 for
val 2 count 487 xfr 2048 file_max 0xfa000.
ftest05     0  TINFO  :         Test[0]: last_trunc = 0x4d800
ftest05     0  TINFO  :         Stat: size=fa000, ino=120399ba
ftest05     0  TINFO  :         Buf:
ftest05     0  TINFO  :         64*0,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :         2,
ftest05     0  TINFO  :          ... more
ftest05     0  TINFO  :         Bits array:
ftest05     0  TINFO  :         0:
ftest05     0  TINFO  :         0:
ftest05     0  TINFO  :         ddx
ftest05     0  TINFO  :         8:
ftest05     0  TINFO  :         ecx

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

--- Additional comment from Worker Ant on 2019-07-05 00:06:33 UTC ---

REVIEW: https://review.gluster.org/22999 (cluster/ec: inherit right mask from
top parent) posted (#1) for review on master by Kinglong Mee

--- Additional comment from Worker Ant on 2019-07-08 13:27:01 UTC ---

REVIEW: https://review.gluster.org/23010 (cluster/ec: inherit healing from lock
which has info) posted (#1) for review on master by Kinglong Mee

--- Additional comment from Pranith Kumar K on 2019-07-10 10:30:39 UTC ---

(In reply to Kinglong Mee from comment #0)
> Description of problem:
> 
> LTP ftestxx tests reports data corruption at a 4+2 disperse volume.
> 
> <<<test_output>>>
> ftest05     1  TFAIL  :  ftest05.c:395:         Test[0] bad verify @ 0x3800
> for val 2 count 487 xfr 2048 file_max 0xfa000.
> ftest05     0  TINFO  :         Test[0]: last_trunc = 0x4d800
> ftest05     0  TINFO  :         Stat: size=fa000, ino=120399ba
> ftest05     0  TINFO  :         Buf:
> ftest05     0  TINFO  :         64*0,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :         2,
> ftest05     0  TINFO  :          ... more
> ftest05     0  TINFO  :         Bits array:
> ftest05     0  TINFO  :         0:
> ftest05     0  TINFO  :         0:
> ftest05     0  TINFO  :         ddx
> ftest05     0  TINFO  :         8:
> ftest05     0  TINFO  :         ecx

When I try to run this test, it is choosing /tmp as the directory where the
file is created. How to change it to the mount directory?
root at localhost - /mnt/ec2 
15:11:08 :( ⚡ /opt/ltp/testcases/bin/ftest05 
ftest05     1  TPASS  :  Test passed.

> 
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:

--- Additional comment from Kinglong Mee on 2019-07-10 12:47:01 UTC ---

(In reply to Pranith Kumar K from comment #3)
> (In reply to Kinglong Mee from comment #0)
> > Description of problem:
> > 
> > LTP ftestxx tests reports data corruption at a 4+2 disperse volume.
> > 
> > <<<test_output>>>
> > ftest05     1  TFAIL  :  ftest05.c:395:         Test[0] bad verify @ 0x3800
> > for val 2 count 487 xfr 2048 file_max 0xfa000.
> > ftest05     0  TINFO  :         Test[0]: last_trunc = 0x4d800
> > ftest05     0  TINFO  :         Stat: size=fa000, ino=120399ba
> > ftest05     0  TINFO  :         Buf:
> > ftest05     0  TINFO  :         64*0,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :         2,
> > ftest05     0  TINFO  :          ... more
> > ftest05     0  TINFO  :         Bits array:
> > ftest05     0  TINFO  :         0:
> > ftest05     0  TINFO  :         0:
> > ftest05     0  TINFO  :         ddx
> > ftest05     0  TINFO  :         8:
> > ftest05     0  TINFO  :         ecx
> 
> 
> When I try to run this test, it is choosing /tmp as the directory where the
> file is created. How to change it to the mount directory?
> root at localhost - /mnt/ec2 
> 15:11:08 :( ⚡ /opt/ltp/testcases/bin/ftest05 
> ftest05     1  TPASS  :  Test passed.

You can run as,
./runltp -p -l /tmp/resut.log -o /tmp/output.log -C /tmp/failed.log -d
/mnt/nfs/ -f casefilename-under-runtest

When running the test at nfs client, there is a bash scripts running which
reboot one node(the cluster node Ganesha.nfsd is not running on) every 600s.

--- Additional comment from Kinglong Mee on 2019-07-11 10:32:22 UTC ---

valgrind reports some memory leak,

==7925== 300 bytes in 6 blocks are possibly lost in loss record 880 of 1,436
==7925==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==7925==    by 0x71828BF: __gf_default_malloc (mem-pool.h:112)
==7925==    by 0x7183182: __gf_malloc (mem-pool.c:131)
==7925==    by 0x713FB65: gf_strndup (mem-pool.h:189)
==7925==    by 0x713FBD5: gf_strdup (mem-pool.h:206)
==7925==    by 0x7144465: loc_copy (xlator.c:1276)
==7925==    by 0x18EDBF1C: ec_loc_from_loc (ec-helpers.c:760)
==7925==    by 0x18F02FE5: ec_manager_open (ec-inode-read.c:778)
==7925==    by 0x18EE4905: __ec_manager (ec-common.c:3094)
==7925==    by 0x18EE4A0F: ec_manager (ec-common.c:3112)
==7925==    by 0x18F037F3: ec_open (ec-inode-read.c:929)
==7925==    by 0x18ED5E85: ec_gf_open (ec.c:1146)

--- Additional comment from Worker Ant on 2019-07-11 11:05:58 UTC ---

REVIEW: https://review.gluster.org/23029 (cluster/ec: do loc_copy from ctx->loc
in fd->lock) posted (#1) for review on master by Kinglong Mee

--- Additional comment from Kinglong Mee on 2019-07-12 00:47:27 UTC ---

ganesha.nfsd crash when healing name,

Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f
/etc/ganesha/ganesha.conf -N N'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
1685        loc.inode = inode_new(parent->table);
Missing separate debuginfos, use: debuginfo-install
bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-12.el7.x86_64
elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64
glibc-2.17-260.el7.x86_64 gssproxy-0.7.0-21.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64
libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64
libblkid-2.23.2-59.el7.x86_64 libcap-2.22-9.el7.x86_64
libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64
libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64
libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-14.1.el7.x86_64
libuuid-2.23.2-59.el7.x86_64 lz4-1.7.5-2.el7.x86_64
openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64
systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
#2  0x00007f0d5ae94455 in ec_synctask_heal_wrap (opaque=0x7f0d24e3c028)
    at ec-heal.c:3139
#3  0x00007f0d6d1268c9 in synctask_wrap () at syncop.c:369
#4  0x00007f0d6c6bf010 in ?? () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) frame 1
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
3050            ret = ec_heal_name(frame, ec, loc->parent, (char *)loc->name,
(gdb) p loc
$1 = (loc_t *) 0x7f0d24e3c358
(gdb) p *loc
$2 = {
  path = 0x7f0d57537d00 "/nfsshare/ltp-eZQlnozjnX/ftegVRmbT/ftest05.20436/b", 
  name = 0x7f0d57537d31 "b", inode = 0x7f0d24255b28, parent = 0x0, 
  gfid = "\263\341\223\031\301\245I\260\234\334\017\to%\305^", 
  pargfid = '\000' <repeats 15 times>}

--- Additional comment from Xavi Hernandez on 2019-07-13 14:09:15 UTC ---

Please, don't use the same bug for different issues.

--- Additional comment from Worker Ant on 2019-07-14 13:03:20 UTC ---

REVISION POSTED: https://review.gluster.org/23029 (cluster/ec: do loc_copy from
ctx->loc in fd->lock) posted (#2) for review on master by Kinglong Mee

--- Additional comment from Worker Ant on 2019-07-16 17:54:25 UTC ---

REVIEW: https://review.gluster.org/23010 (cluster/ec: inherit healing from lock
when it has info) merged (#4) on master by Amar Tumballi

--- Additional comment from Ashish Pandey on 2019-07-17 05:27:44 UTC ---

There are two patches associated with this BZ - 

https://review.gluster.org/#/c/glusterfs/+/22999/ - No merged and under review
https://review.gluster.org/#/c/glusterfs/+/23010/ - Merged 

I would like to keep this bug open till both the patches get merged.

--
Ashish

--- Additional comment from Worker Ant on 2019-07-18 07:28:12 UTC ---

REVIEW: https://review.gluster.org/23069 ((WIP)cluster/ec: Always read from
good-mask) posted (#1) for review on master by Pranith Kumar Karampuri

--- Additional comment from Worker Ant on 2019-07-23 06:20:08 UTC ---

REVIEW: https://review.gluster.org/23073 (cluster/ec: fix data corruption)
posted (#4) for review on master by Pranith Kumar Karampuri

--- Additional comment from Worker Ant on 2019-07-26 07:11:59 UTC ---

REVIEW: https://review.gluster.org/23069 (cluster/ec: Always read from
good-mask) merged (#6) on master by Pranith Kumar Karampuri

--- Additional comment from Pranith Kumar K on 2019-08-02 07:35:34 UTC ---

Found one case which needs to be fixed.

--- Additional comment from Worker Ant on 2019-08-02 07:38:12 UTC ---

REVIEW: https://review.gluster.org/23147 (cluster/ec: Update lock->good_mask on
parent fop failure) posted (#1) for review on master by Pranith Kumar Karampuri

--- Additional comment from Worker Ant on 2019-08-07 06:15:15 UTC ---

REVIEW: https://review.gluster.org/23147 (cluster/ec: Update lock->good_mask on
parent fop failure) merged (#2) on master by Pranith Kumar Karampuri

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1727081
[Bug 1727081] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1730914
[Bug 1730914] [GSS] Sometimes truncate and discard could cause data corruption
when executed while self-heal is running
https://bugzilla.redhat.com/show_bug.cgi?id=1732772
[Bug 1732772] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1732774
[Bug 1732774] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1732778
[Bug 1732778] [GSS] Sometimes truncate and discard could cause data corruption
when executed while self-heal is running
https://bugzilla.redhat.com/show_bug.cgi?id=1732792
[Bug 1732792] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1739424
[Bug 1739424] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1739449
[Bug 1739449] Disperse volume : data corruption with ftruncate data in 4+2
config
https://bugzilla.redhat.com/show_bug.cgi?id=1805051
[Bug 1805051] Disperse volume : data corruption with ftruncate data in 4+2
config
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.