[Bugs] [Bug 1454569] New: [geo-rep + nl]: Multiple crashes observed on slave with "nlc_lookup_cbk"

bugzilla at redhat.com bugzilla at redhat.com
Tue May 23 04:59:27 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1454569

            Bug ID: 1454569
           Summary: [geo-rep + nl]: Multiple crashes observed on slave
                    with "nlc_lookup_cbk"
           Product: GlusterFS
           Version: 3.11
         Component: unclassified
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: pgurusid at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com, rallan at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com, vbellur at redhat.com
        Depends On: 1450904, 1451588



+++ This bug was initially created as a clone of Bug #1451588 +++

+++ This bug was initially created as a clone of Bug #1450904 +++

Description of problem:
=========================
Found 218 cores on a single slave node while running sanity check of
geo-replication with nl enabled.. 


# gdb glusterfsd /core.20805
    GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
    Copyright (C) 2013 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-redhat-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
    done.

    warning: core file may not match specified executable file.
    [New LWP 20820]
    [New LWP 20806]
    [New LWP 20807]
    [New LWP 20810]
    [New LWP 20819]
    [New LWP 20823]
    [New LWP 20805]
    [New LWP 20818]
    [New LWP 20809]
    [New LWP 20811]
    [New LWP 20822]
    [New LWP 20808]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-re'.
    Program terminated with signal 11, Segmentation fault.
    #0  nlc_dir_add_ne (this=0x7fac34021160, inode=0x0, name=0x0) at
nl-cache-helper.c:823
    823             if (inode->ia_type != IA_IFDIR) {
    Missing separate debuginfos, use: debuginfo-install
glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.14.1-27.el7_3.x86_64 libcom_err-1.42.9-9.el7.x86_64
libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64
libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64
pcre-8.32-15.el7_2.1.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64
zlib-1.2.7-17.el7.x86_64
    (gdb) bt
    #0  nlc_dir_add_ne (this=0x7fac34021160, inode=0x0, name=0x0) at
nl-cache-helper.c:823
    #1  0x00007fac3a554a3e in nlc_lookup_cbk (frame=0x7fac3000ff20,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=2, inode=0x0,
buf=0x7fac30001ac8, xdata=0x0, postparent=0x7fac30001cf8)
        at nl-cache.c:203
    #2  0x00007fac3a969cb3 in qr_lookup_cbk (frame=frame at entry=0x7fac30007be0,
cookie=<optimized out>, this=<optimized out>, op_ret=op_ret at entry=-1,
op_errno=op_errno at entry=2, inode_ret=inode_ret at entry=0x0,
        buf=buf at entry=0x7fac30001ac8, xdata=xdata at entry=0x0,
postparent=postparent at entry=0x7fac30001cf8) at quick-read.c:446
    #3  0x00007fac3ab75cee in ioc_lookup_cbk (frame=0x7fac300109a0,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
op_errno=<optimized out>, inode=0x0, stbuf=0x7fac30001ac8, xdata=0x0,
        postparent=0x7fac30001cf8) at io-cache.c:255
    #4  0x00007fac3b3dd507 in dht_discover_complete
(this=this at entry=0x7fac34016fe0,
discover_frame=discover_frame at entry=0x7fac300175d0) at dht-common.c:572
    #5  0x00007fac3b3de29b in dht_discover_cbk (frame=0x7fac300175d0,
cookie=<optimized out>, this=0x7fac34016fe0, op_ret=<optimized out>,
op_errno=2, inode=0x7fac1c000f30, stbuf=0x7fac300073d0, xattr=0x0,
        postparent=0x7fac30007440) at dht-common.c:701
    #6  0x00007fac3b68a2f0 in afr_discover_done (this=<optimized out>,
frame=0x7fac30003270) at afr-common.c:2615
    #7  afr_discover_cbk (frame=frame at entry=0x7fac30003270, cookie=<optimized
out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>,
inode=inode at entry=0x7fac1c000f30,
        buf=buf at entry=0x7fac2bffe940, xdata=0x0,
postparent=postparent at entry=0x7fac2bffe9b0) at afr-common.c:2660
    #8  0x00007fac3b8c72c7 in client3_3_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7fac30008720) at
client-rpc-fops.c:2947
    #9  0x00007fac49139840 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7fac340767d0, pollin=pollin at entry=0x7fac24004950) at
rpc-clnt.c:794
    #10 0x00007fac49139b27 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7fac34076800, event=<optimized out>, data=0x7fac24004950) at
rpc-clnt.c:987
    #11 0x00007fac491359e3 in rpc_transport_notify
(this=this at entry=0x7fac340769d0, event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
data=data at entry=0x7fac24004950) at rpc-transport.c:538
    #12 0x00007fac3ddb03b4 in socket_event_poll_in
(this=this at entry=0x7fac340769d0) at socket.c:2275
    #13 0x00007fac3ddb2895 in socket_event_handler (fd=<optimized out>, idx=2,
data=0x7fac340769d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2411
    #14 0x00007fac493c9e00 in event_dispatch_epoll_handler
(event=0x7fac2bffee80, event_pool=0x7fac4a524730) at event-epoll.c:572
    #15 event_dispatch_epoll_worker (data=0x7fac34076500) at event-epoll.c:675
    #16 0x00007fac481cfdc5 in start_thread () from /lib64/libpthread.so.0
    #17 0x00007fac47b1473d in clone () from /lib64/libc.so.6
    (gdb) list 1
    1       /*
    2        *   Copyright (c) 2017 Red Hat, Inc. <http://www.redhat.com>
    3        *   This file is part of GlusterFS.
    4        *
    5        *   This file is licensed to you under your choice of the GNU
Lesser
    6        *   General Public License, version 3 or any later version (LGPLv3
or
    7        *   later), or the GNU General Public License, version 2 (GPLv2),
in all
    8        *   cases as published by the Free Software Foundation.
    9        */
    10     
    (gdb) list nl_lookup_cbk
    Function "nl_lookup_cbk" not defined.
    (gdb) list nlc_lookup_cbk
    188    
    189     static int32_t
    190     nlc_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
    191                     int32_t op_ret, int32_t op_errno, inode_t *inode,
    192                     struct iatt *buf, dict_t *xdata, struct iatt
*postparent)
    193     {
    194             nlc_local_t *local = NULL;
    195             nlc_conf_t  *conf  = NULL;
    196    
    197             local = frame->local;
    (gdb)
    198             conf = this->private;
    199    
    200             /* Donot add to pe, this may lead to duplicate entry and
    201              * requires search before adding if list of strings */
    202             if (op_ret < 0 && op_errno == ENOENT) {
    203                     nlc_dir_add_ne (this, local->loc.parent,
local->loc.name);
    204                     GF_ATOMIC_INC (conf->nlc_counter.nlc_miss);
    205             }
    206    
    207             NLC_STACK_UNWIND (lookup, frame, op_ret, op_errno, inode,
buf, xdata,
    (gdb)
    208                              postparent);
    209             return 0;
    210     }
    211    
    212    
    213     static int32_t
    214     nlc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t
*xdata)
    215     {
    216             nlc_local_t *local = NULL;
    217             nlc_conf_t  *conf  = NULL;
    (gdb) f 1
    #1  0x00007fac3a554a3e in nlc_lookup_cbk (frame=0x7fac3000ff20,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=2, inode=0x0,
buf=0x7fac30001ac8, xdata=0x0, postparent=0x7fac30001cf8)
        at nl-cache.c:203
    203                     nlc_dir_add_ne (this, local->loc.parent,
local->loc.name);
    (gdb) p *local->loc
    Structure has no component named operator*.
    (gdb) p local->loc
    $1 = {path = 0x0, name = 0x0, inode = 0x7fac1c000f30, parent = 0x0, gfid =
"̲ǝ\f1Aڜ\353iq61Y)", pargfid = '\000' <repeats 15 times>}
    (gdb) quit
    [root at dhcp37-82 ~]#


Ingeneral following were the steps:
1. Create Master and Slave cluster and volume
2. Start them
3. Enable nl options on master and slave
4. Mount the Master volume via cifs mount
5. Create following fops on master volume:
{create,chmod,chown,chgrp,rename,hardlink,symlink,truncate,remove}

Trying the fops one by one to see the specific. However following shows the aux
mount of geo-replication

[root at dhcp37-82 ~]# file /core.9948
/core.9948: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from
'/usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-re', real uid: 0, effective uid: 0, real gid:
0, effective gid: 0, execfn: '/usr/sbin/glusterfs', platform: 'x86_64'
[root at dhcp37-82 ~]#

The issue is with nameless lookup and negative lookup cache. The crash happened
because negative lookup caching is enabled on slave. As discussed with Poorima,
negative lookup caching is not possible with nameless lookups. A fix is
required for the same.

--- Additional comment from Worker Ant on 2017-05-17 03:07:32 EDT ---

REVIEW: https://review.gluster.org/17316 (nl-cache: In case of nameless
operations do not cache) posted (#1) for review on master by Poornima G
(pgurusid at redhat.com)

--- Additional comment from Worker Ant on 2017-05-18 02:11:05 EDT ---

REVIEW: https://review.gluster.org/17316 (nl-cache: In case of nameless
operations do not cache) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)

--- Additional comment from Worker Ant on 2017-05-22 08:40:08 EDT ---

COMMIT: https://review.gluster.org/17316 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 284cd8851bfe60984d2f11b5c52fe3204ff43b06
Author: Poornima G <pgurusid at redhat.com>
Date:   Tue May 16 19:25:20 2017 +0530

    nl-cache: In case of nameless operations do not cache

    Issue:
    In nameless lookup/other fops, parent inode will be NULL, when we try
    to add the cache to the NULL inode, it causes a crash.

    Hence handle the scenario of nameless fops, and do not cache/serve
    the nameless fops.

    Change-Id: I3b90f882ac89e6aaf3419db89e6f890797f37700
    BUG: 1451588
    Signed-off-by: Poornima G <pgurusid at redhat.com>
    Reviewed-on: https://review.gluster.org/17316
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1450904
[Bug 1450904] [geo-rep + nl]: Multiple crashes observed on slave with
"nlc_lookup_cbk"
https://bugzilla.redhat.com/show_bug.cgi?id=1451588
[Bug 1451588] [geo-rep + nl]: Multiple crashes observed on slave with
"nlc_lookup_cbk"
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list