[Bugs] [Bug 1559352] New: [Ganesha] : Ganesha crashes while cluster enters failover/failback mode

bugzilla at redhat.com bugzilla at redhat.com
Thu Mar 22 11:27:36 UTC 2018


https://bugzilla.redhat.com/show_bug.cgi?id=1559352

            Bug ID: 1559352
           Summary: [Ganesha] : Ganesha crashes while cluster enters
                    failover/failback mode
           Product: GlusterFS
           Version: 3.10
         Component: libgfapi
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: jthottan at redhat.com
        QA Contact: bugs at gluster.org
                CC: amukherj at redhat.com, asoman at redhat.com,
                    bturner at redhat.com, bugs at gluster.org, dang at redhat.com,
                    ffilz at redhat.com, jthottan at redhat.com,
                    kkeithle at redhat.com, mbenjamin at redhat.com,
                    msaini at redhat.com, pasik at iki.fi, rhinduja at redhat.com,
                    rhs-bugs at redhat.com, skoduri at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1460514
            Blocks: 1457183, 1477994



+++ This bug was initially created as a clone of Bug #1460514 +++

+++ This bug was initially created as a clone of Bug #1457183 +++

Description of problem:
-----------------------


4 node cluster,4 clients accessing the export via v4.

Kill NFS-Ganesha on any node to simulate failover.

Ganesha crashes in a minute when failover is about to get over.

Try restarting Ganesha to simulate failback.

Ganesha dumps the same core again.


(gdb) bt
#0  0x00007fb042dac8a0 in glusterfs_normalize_dentry () from
/lib64/libglusterfs.so.0
#1  0x00007fb04307eac8 in glfs_resolve_at () from /lib64/libgfapi.so.0
#2  0x00007fb0430805c4 in glfs_h_lookupat () from /lib64/libgfapi.so.0
#3  0x00007fb04349d6df in lookup (parent=0x7faf74003fa8, path=0x55a9dec3a482
"..", handle=0x7fb019db0af8, 
    attrs_out=0x0) at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:113
#4  0x000055a9dec28b7f in mdc_get_parent (export=export at entry=0x55a9e066bbf0,
entry=0x7faf74004260)
    at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:287
#5  0x000055a9dec257f5 in mdcache_create_handle (exp_hdl=0x55a9e066bbf0,
hdl_desc=<optimized out>, 
    handle=0x7fb019db0be8, attrs_out=0x0)
    at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1768
#6  0x000055a9deb91daa in nfs4_mds_putfh (data=data at entry=0x7fb019db1180)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:211
#7  0x000055a9deb922c0 in nfs4_op_putfh (op=0x7faf880017e0,
data=0x7fb019db1180, resp=0x7faf74000ad0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:281
#8  0x000055a9deb81bbd in nfs4_Compound (arg=<optimized out>, req=<optimized
out>, res=0x7faf740009c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#9  0x000055a9deb72d6c in nfs_rpc_execute
(reqdata=reqdata at entry=0x7faf880008c0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#10 0x000055a9deb743ca in worker_run (ctx=0x55a9e0758ec0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#11 0x000055a9debfd999 in fridgethr_start_routine (arg=0x55a9e0758ec0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#12 0x00007fb046401e25 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fb045acf34d in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):
--------------------------------------------------------------


[root at gqas013 tmp]# rpm -qa|grep ganesha
glusterfs-ganesha-3.8.4-26.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-6.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.4-6.el7rhgs.x86_64
nfs-ganesha-2.4.4-6.el7rhgs.x86_64
[root at gqas013 tmp]# 
[root at gqas013 tmp]# 
[root at gqas013 tmp]# rpm -qa|grep libnti
libntirpc-1.4.3-1.el7rhgs.x86_64
libntirpc-devel-1.4.3-1.el7rhgs.x86_64
[root at gqas013 tmp]# 

[root at gqas013 tmp]# rpm -qa|grep pacem
pacemaker-cluster-libs-1.1.16-10.el7.x86_64
pacemaker-cli-1.1.16-10.el7.x86_64
pacemaker-1.1.16-10.el7.x86_64
pacemaker-libs-1.1.16-10.el7.x86_64
[root at gqas013 tmp]# 

[root at gqas013 tmp]# rpm -qa|grep coros
corosynclib-2.4.0-9.el7.x86_64
corosync-2.4.0-9.el7.x86_64
[root at gqas013 tmp]# 

[root at gqas013 tmp]# rpm -qa|grep resource-ag
resource-agents-3.9.5-100.el7.x86_64

[root at gqas013 tmp]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo)
[root at gqas013 tmp]# 


How reproducible:
-----------------
3/3

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-31 

Clearer BT :


#0  glusterfs_normalize_dentry (parent=parent at entry=0x7fc99fb96778,
component=component at entry=0x7fc99fb96780, 
    dentry_name=dentry_name at entry=0x7fc99fb96840 "") at inode.c:2646
#1  0x00007fca36f12ac8 in priv_glfs_resolve_at (fs=fs at entry=0x55f5f8b01060,
subvol=subvol at entry=0x7fca1c01fe00, 
    at=at at entry=0x7fc734020090, origpath=origpath at entry=0x55f5f6f1b482 "..",
loc=loc at entry=0x7fc99fb978d0, 
    iatt=iatt at entry=0x7fc99fb97910, follow=follow at entry=0, reval=reval at entry=0)
at glfs-resolve.c:412
#2  0x00007fca36f145c4 in pub_glfs_h_lookupat (fs=0x55f5f8b01060,
parent=<optimized out>, 
    path=path at entry=0x55f5f6f1b482 "..", stat=stat at entry=0x7fc99fb979f0,
follow=follow at entry=0)
    at glfs-handleops.c:102
#3  0x00007fca36f146a8 in pub_glfs_h_lookupat34 (fs=<optimized out>,
parent=<optimized out>, 
    path=path at entry=0x55f5f6f1b482 "..", stat=stat at entry=0x7fc99fb979f0) at
glfs-handleops.c:133
#4  0x00007fca373316df in lookup (parent=0x7fc734031fa8, path=0x55f5f6f1b482
"..", handle=0x7fc99fb97af8, 
    attrs_out=0x0) at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/FSAL_GLUSTER/handle.c:113
#5  0x000055f5f6f09b7f in mdc_get_parent (export=export at entry=0x55f5f8b00bf0,
entry=0x7fc734033ee0)
    at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:287
#6  0x000055f5f6f067f5 in mdcache_create_handle (exp_hdl=0x55f5f8b00bf0,
hdl_desc=<optimized out>, 
    handle=0x7fc99fb97be8, attrs_out=0x0)
    at
/usr/src/debug/nfs-ganesha-2.4.4/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1768
#7  0x000055f5f6e72daa in nfs4_mds_putfh (data=data at entry=0x7fc99fb98180)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:211
#8  0x000055f5f6e732c0 in nfs4_op_putfh (op=0x7fc97c243bd0,
data=0x7fc99fb98180, resp=0x7fc734031bd0)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_op_putfh.c:281
#9  0x000055f5f6e62bbd in nfs4_Compound (arg=<optimized out>, req=<optimized
out>, res=0x7fc734029050)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/Protocols/NFS/nfs4_Compound.c:734
#10 0x000055f5f6e53d6c in nfs_rpc_execute
(reqdata=reqdata at entry=0x7fc97c243400)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1281
#11 0x000055f5f6e553ca in worker_run (ctx=0x55f5f8c2d870)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/MainNFSD/nfs_worker_thread.c:1548
#12 0x000055f5f6ede999 in fridgethr_start_routine (arg=0x55f5f8c2d870)
    at /usr/src/debug/nfs-ganesha-2.4.4/src/support/fridgethr.c:550
#13 0x00007fca3a295e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fca3996334d in clone () from /lib64/libc.so.6
(gdb)


--- Additional comment from Manisha Saini on 2017-05-31 11:18:30 EDT ---

Crash is not specific to rhel  7.4

Its even observed in Rhel 7.3 - After doing node reboot,node is unable to come
up.

bt-

(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/gssproxy/proxymech.so
Reading symbols from /lib64/libgssrpc.so.4...Reading symbols from
/lib64/libgssrpc.so.4...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssrpc.so.4
0x00007f5fe2aceef7 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install
nfs-ganesha-2.4.4-6.el7rhgs.x86_64
(gdb) c
Continuing.
[New Thread 0x7f5f002bb700 (LWP 14671)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5f81bbe700 (LWP 11392)]
0x00007f5fdf2758a0 in glusterfs_normalize_dentry () from
/lib64/libglusterfs.so.0
(gdb) bt
#0  0x00007f5fdf2758a0 in glusterfs_normalize_dentry () from
/lib64/libglusterfs.so.0
#1  0x00007f5fdf547ac8 in glfs_resolve_at () from /lib64/libgfapi.so.0
#2  0x00007f5fdf5495c4 in glfs_h_lookupat () from /lib64/libgfapi.so.0
#3  0x00007f5fdf9666df in lookup () from /usr/lib64/ganesha/libfsalgluster.so
#4  0x00007f5fe4617b7f in mdc_get_parent ()
#5  0x00007f5fe46147f5 in mdcache_create_handle ()
#6  0x00007f5fe4580daa in nfs4_mds_putfh ()
#7  0x00007f5fe45812c0 in nfs4_op_putfh ()
#8  0x00007f5fe4570bbd in nfs4_Compound ()
#9  0x00007f5fe4561d6c in nfs_rpc_execute ()
#10 0x00007f5fe45633ca in worker_run ()
#11 0x00007f5fe45ec999 in fridgethr_start_routine ()
#12 0x00007f5fe2acddc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f5fe219c76d in clone () from /lib64/libc.so.6


Version-
nfs-ganesha-2.4.4-6.el7rhgs.x86_64


--- Additional comment from Jiffin on 2017-06-02 09:33:52 EDT ---

RCA :
In above case, following should have happened
linux untar was running on four different servers on four different directories
using four different servers, let's say client1 was creating file a/b/c/file
and failover happened(server1 got killed), so now it sends requests to server2.
Usually the in nfs world all the communication happen via file handles. First
server will try to create handle for that file using given gfid,  but server
does not have context about a/b/c till now. After creation of handle, it tries
do lookup on parent using "..".
During parent inode resolution the glusterfs_normalize_dentry() tries replcae
".." with its name and then it crashes, because this function has assumption
inode of parent already linked to table. But in above case inode of parent
never linked to table. Finally ending up in killing all the ganesha servers

There are two solution to resolve this issue [1] handle inode_parent failures
in glusterfs_normalise_dentry() or [2] revert the changes made to 
glfs_resolve_component() in  https://review.gluster.org/#/c/17177 and then call
glusterfs_normalise_dentry() followed by it

Also thanks Soumya & Rafi for help in debugging the issue

--- Additional comment from Worker Ant on 2017-06-11 13:15:40 EDT ---

REVIEW: https://review.gluster.org/17502 (gfapi : Resolve "." and ".." only for
named lookups) posted (#1) for review on master by jiffin tony Thottan
(jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-06-16 01:43:19 EDT ---

REVIEW: https://review.gluster.org/17502 (gfapi : Resolve "." and ".." only for
named lookups) posted (#2) for review on master by jiffin tony Thottan
(jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-06-16 03:22:20 EDT ---

REVIEW: https://review.gluster.org/17502 (gfapi : Resolve "." and ".." only for
named lookups) posted (#3) for review on master by jiffin tony Thottan
(jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-06-20 05:25:57 EDT ---

REVIEW: https://review.gluster.org/17502 (gfapi : Resolve "." and ".." only for
named lookups) posted (#4) for review on master by jiffin tony Thottan
(jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-06-20 08:32:24 EDT ---

COMMIT: https://review.gluster.org/17502 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit a052b413242783f39cb3312a6a02bdd025b10f0c
Author: Jiffin Tony Thottan <jthottan at redhat.com>
Date:   Sun Jun 11 07:33:52 2017 +0530

    gfapi : Resolve "." and ".." only for named lookups

    The patch https://review.gluster.org/#/c/17177 resolves "." and ".."
    to corrosponding inodes and names before sending the request to the
    backend server. But this will only work if inode and its parent is
    linked properly. Incase of nameless lookup(applications like ganesha)
    the inode of parent can be NULL(only gfid is send). So this patch will
    resolve "." and ".." only if proper parent is available

    Change-Id: I4c50258b0d896dabf000a547ab180b57df308a0b
    BUG: 1460514
    Signed-off-by: Jiffin Tony Thottan <jthottan at redhat.com>
    Reviewed-on: https://review.gluster.org/17502
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Poornima G <pgurusid at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: soumya k <skoduri at redhat.com>
    Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>

--- Additional comment from Worker Ant on 2017-07-21 02:48:35 EDT ---

REVIEW: https://review.gluster.org/17844 (tests/gfapi : add test case for the
commit a052b4) posted (#1) for review on master by jiffin tony Thottan
(jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-07-24 05:52:09 EDT ---

REVIEW: https://review.gluster.org/17844 (tests/gfapi : add test case for
nameless lookups in glfs_resolve_component()) posted (#2) for review on master
by jiffin tony Thottan (jthottan at redhat.com)

--- Additional comment from Worker Ant on 2017-08-02 08:09:58 EDT ---

COMMIT: https://review.gluster.org/17844 committed in master by soumya k
(skoduri at redhat.com) 
------
commit 5c433f8f5834a4cae62d0375bfdb273242630f01
Author: Jiffin Tony Thottan <jthottan at redhat.com>
Date:   Fri Jul 21 12:14:25 2017 +0530

    tests/gfapi : add test case for nameless lookups in
glfs_resolve_component()

    Plus address pending comment to add check for entry "" in
glfs_resolve_component()

    Change-Id: I6063f776ce1cd76cb4c1b1f621b064f3dcc91e5c
    BUG: 1460514
    Signed-off-by: Jiffin Tony Thottan <jthottan at redhat.com>
    Reviewed-on: https://review.gluster.org/17844
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Niels de Vos <ndevos at redhat.com>
    Reviewed-by: soumya k <skoduri at redhat.com>

--- Additional comment from Shyamsundar on 2017-09-05 13:33:35 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

--- Additional comment from Shyamsundar on 2017-12-08 12:33:42 EST ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1457183
[Bug 1457183] [Ganesha] : Ganesha crashes while cluster enters
failover/failback mode and during basic IO with the same BT.
https://bugzilla.redhat.com/show_bug.cgi?id=1460514
[Bug 1460514] [Ganesha] : Ganesha crashes while cluster enters
failover/failback mode
https://bugzilla.redhat.com/show_bug.cgi?id=1477994
[Bug 1477994] [Ganesha] : Ganesha crashes while cluster enters
failover/failback mode
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list