[Bugs] [Bug 1805052] New: Disperse volume : Ganesha crash with IO in 4+2 config when one glusterfsd restart every 600s

Thu Feb 20 07:38:33 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1805052

            Bug ID: 1805052
           Summary: Disperse volume : Ganesha crash with IO in 4+2 config
                    when one glusterfsd restart every 600s
           Product: GlusterFS
           Version: 5
            Status: NEW
         Component: disperse
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: aspandey at redhat.com, bugs at gluster.org,
                    kinglongmee at gmail.com
        Depends On: 1729772
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1729772 +++

Description of problem:

LTP ftestxx tests at a 4+2 disperse volume.
When running the test at nfs client, a bash scripts running which reboot one
node(the cluster node Ganesha.nfsd is not running on) every 600s.

ganesha.nfsd crash when healing name,

Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f
/etc/ganesha/ganesha.conf -N N'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
1685        loc.inode = inode_new(parent->table);
Missing separate debuginfos, use: debuginfo-install
bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-12.el7.x86_64
elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64
glibc-2.17-260.el7.x86_64 gssproxy-0.7.0-21.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64
libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64
libblkid-2.23.2-59.el7.x86_64 libcap-2.22-9.el7.x86_64
libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64
libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64
libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-14.1.el7.x86_64
libuuid-2.23.2-59.el7.x86_64 lz4-1.7.5-2.el7.x86_64
openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64
systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
#2  0x00007f0d5ae94455 in ec_synctask_heal_wrap (opaque=0x7f0d24e3c028)
    at ec-heal.c:3139
#3  0x00007f0d6d1268c9 in synctask_wrap () at syncop.c:369
#4  0x00007f0d6c6bf010 in ?? () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) frame 1
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
3050            ret = ec_heal_name(frame, ec, loc->parent, (char *)loc->name,
(gdb) p loc
$1 = (loc_t *) 0x7f0d24e3c358
(gdb) p *loc
$2 = {
  path = 0x7f0d57537d00 "/nfsshare/ltp-eZQlnozjnX/ftegVRmbT/ftest05.20436/b", 
  name = 0x7f0d57537d31 "b", inode = 0x7f0d24255b28, parent = 0x0, 
  gfid = "\263\341\223\031\301\245I\260\234\334\017\to%\305^", 
  pargfid = '\000' <repeats 15 times>}

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

--- Additional comment from Worker Ant on 2019-07-14 13:03:24 UTC ---

REVIEW: https://review.gluster.org/23029 (cluster/ec: do loc_copy from ctx->loc
in fd->lock) posted (#2) for review on master by Kinglong Mee

--- Additional comment from Worker Ant on 2019-07-17 16:24:24 UTC ---

REVIEW: https://review.gluster.org/23029 (cluster/ec: skip updating ctx->loc
again when ec_fix_open/opendir) merged (#4) on master by Kinglong Mee

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1729772
[Bug 1729772] Disperse volume : Ganesha crash with IO in 4+2 config when one
glusterfsd restart every 600s
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.