[Bugs] [Bug 1397386] New: [Ganesha] : Ganesha crashes intermittently during nfs-ganesha restarts.

Tue Nov 22 12:27:25 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1397386

            Bug ID: 1397386
           Summary: [Ganesha] : Ganesha crashes intermittently during
                    nfs-ganesha restarts.
           Product: GlusterFS
           Version: 3.8
         Component: io-threads
          Keywords: Triaged
          Severity: high
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: ndevos at redhat.com
                CC: bugs at gluster.org, pkarampu at redhat.com
        Depends On: 1396793

+++ This bug was initially created as a clone of Bug #1396793 +++

... snip'd some comments

--- Additional comment from Soumya Koduri on 2016-11-09 13:33:18 EST ---

Ambarish,
If you happen to reproduce the issue, please take the core (using gdb) before
running service stop/restart so as to compare the threads before and after the
crash. Thanks!

--- Additional comment from Ambarish on 2016-11-10 06:34:54 EST ---

I tried it twice,but I could not reproduce the issue post setting
client-io-threads to "off".

The issue is a bit intermittent,so it's hard to say that with certainty
,though. (if that is or is not the culprit).

--- Additional comment from Soumya Koduri on 2016-11-10 06:47:36 EST ---

>From the cores and logs provided here, 

gqas006 :
No symbol table info available.
#1  0x00007fb6f120180c in iot_worker (data=0x7fb6d4022be0) at io-threads.c:180
        conf = 0x7fb6d4022be0
        this = <optimized out>
        stub = 0x0
        sleep_till = {tv_sec = 1478711539, tv_nsec = 0}
        ret = <optimized out>
        pri = -1
        sleep = {tv_sec = 0, tv_nsec = 0}
        bye = _gf_false
        __FUNCTION__ = "iot_worker"

gqas005 :
Thread 10 (Thread 0x7f5c45e03700 (LWP 722)):
#0  0x00007f5cdb628a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f5c2e5ae80c in iot_worker (data=0x7f5c10022be0) at io-threads.c:180
        conf = 0x7f5c10022be0
        this = <optimized out>
        stub = 0x0
        sleep_till = {tv_sec = 1478711539, tv_nsec = 0}
        ret = <optimized out>
        pri = -1
        sleep = {tv_sec = 0, tv_nsec = 0}
        bye = _gf_false
        __FUNCTION__ = "iot_worker"

There was iot_worker thread still active in both the cores provided.

In the logs,

gqas006:
Nov  9 12:10:23 gqas006 systemd: Stopping GlusterFS, a clustered file-system
server...
Nov  9 12:10:23 gqas006 systemd: glusterd.service: main process exited,
code=exited, status=15/n/a
Nov  9 12:10:23 gqas006 systemd: Unit glusterd.service entered failed state.
Nov  9 12:10:23 gqas006 systemd: glusterd.service failed.
Nov  9 12:10:23 gqas006 systemd: Starting GlusterFS, a clustered file-system
server...
Nov  9 12:10:27 gqas006 kernel: ganesha.nfsd[18212]: segfault at 7fb6f39e780c
ip 00007fb6f39e780c sp 00007fb7001e9500 error 15
Nov  9 12:10:27 gqas006 systemd: Started GlusterFS, a clustered file-system
server.
Nov  9 12:10:29 gqas006 systemd: Stopping GlusterFS, a clustered file-system
server...
Nov  9 12:10:29 gqas006 systemd: glusterd.service: main process exited,
code=exited, status=15/n/a
Nov  9 12:10:29 gqas006 systemd: Unit glusterd.service entered failed state.
Nov  9 12:10:29 gqas006 systemd: glusterd.service failed.
Nov  9 12:10:29 gqas006 systemd: Stopping RPC bind service...

gqas005:
Nov  9 12:10:23 gqas005 systemd: Stopping GlusterFS, a clustered file-system
server...
Nov  9 12:10:23 gqas005 systemd: glusterd.service: main process exited,
code=exited, status=15/n/a
Nov  9 12:10:23 gqas005 systemd: Unit glusterd.service entered failed state.
Nov  9 12:10:23 gqas005 systemd: glusterd.service failed.
Nov  9 12:10:23 gqas005 systemd: Starting GlusterFS, a clustered file-system
server...
Nov  9 12:10:27 gqas005 kernel: ganesha.nfsd[32318]: segfault at 7f5c3c34880c
ip 00007f5c3c34880c sp 00007f5c3c11d500 error 14 in
write-behind.so[7f5c3c1ee000+200000]
Nov  9 12:10:27 gqas005 systemd: Started GlusterFS, a clustered file-system
server.
Nov  9 12:10:29 gqas005 systemd: nfs-ganesha.service: main process exited,
code=killed, status=11/SEGV
Nov  9 12:10:29 gqas005 systemd: Unit nfs-ganesha.service entered failed state.
Nov  9 12:10:29 gqas005 systemd: nfs-ganesha.service failed.

So the ganesha server has crashed during glusterd service start.

--- Additional comment from Soumya Koduri on 2016-11-10 07:08:34 EST ---

Ambarish could reproduce the issue. And this time the crash happened before the
restart of the services.

Nov 10 05:28:39 gqas014 kernel: ganesha.nfsd[31902]: segfault at 7f4e6817e80c
ip 00007f4e6817e80c sp 00007f4e68178500 error 14 in
libssl.so.1.0.1e[7f4e69726000+64000]
Nov 10 05:28:42 gqas014 systemd: nfs-ganesha.service: main process exited,
code=killed, status=11/SEGV
Nov 10 05:28:42 gqas014 systemd: Unit nfs-ganesha.service entered failed state.
Nov 10 05:28:42 gqas014 systemd: nfs-ganesha.service failed.

Nov 10 05:31:15 gqas014 systemd: Stopping GlusterFS, a clustered file-system
server...
Nov 10 05:31:15 gqas014 systemd: glusterd.service: main process exited,
code=exited, status=15/n/a
Nov 10 05:31:15 gqas014 systemd: Unit glusterd.service entered failed state.
Nov 10 05:31:15 gqas014 systemd: glusterd.service failed.
Nov 10 05:31:15 gqas014 systemd: Starting GlusterFS, a clustered file-system
server...
Nov 10 05:31:19 gqas014 systemd: Started GlusterFS, a clustered file-system
server.
Nov 10 05:31:22 gqas014 systemd: Stopping GlusterFS, a clustered file-system
server...
Nov 10 05:31:22 gqas014 systemd: glusterd.service: main process exited,
code=exited, status=15/n/a
Nov 10 05:31:22 gqas014 systemd: Unit glusterd.service entered failed state.
Nov 10 05:31:22 gqas014 systemd: glusterd.service failed.
Nov 10 05:31:22 gqas014 systemd: Stopping RPC bind service...
Nov 10 05:31:22 gqas014 systemd: Starting RPC bind service...

So maybe the crash is not related to any service restart but just happened to
be at the same time by coincidence.

And this time he could collect the information requested in comment#5,

>> before crash
(gdb) thread 272
[Switching to thread 272 (Thread 0x7f4e68179700 (LWP 31902))]
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
238    62:    movq    %rax, %r14
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f4e6817e80c in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) 

>> the core generated with segfault -

Thread 1 (Thread 0x7f4e68179700 (LWP 31902)):
#0  0x00007f4e6817e80c in ?? ()
No symbol table info available.

Looks like a stack corruption.

>From ganesha.log,

10/11/2016 05:26:35 : epoch 14570000 : gqas014.sbu.lab.eng.bos.redhat.com :
ganesha.nfsd-30752[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume
testvol exported at : '/'
10/11/2016 05:26:39 : epoch 14570000 : gqas014.sbu.lab.eng.bos.redhat.com :
ganesha.nfsd-30752[dbus_heartbeat] glusterfs_create_export :FSAL :CRIT :Unable
to initialize volume. Export: /testvol
10/11/2016 05:26:40 : epoch 14570000 : gqas014.sbu.lab.eng.bos.redhat.com :
ganesha.nfsd-30752[dbus_heartbeat] fsal_cfg_commit :CONFIG :CRIT :Could not
create export for (/testvol) to (/testvol)
10/11/2016 05:26:40 : epoch 14570000 : gqas014.sbu.lab.eng.bos.redhat.com :
ganesha.nfsd-30752[dbus_heartbeat] gsh_export_addexport :EXPORT :CRIT :0 export
entries in
/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf added
because (export create, block validation) errors
10/11/2016 05:26:40 : epoch 14570000 : gqas014.sbu.lab.eng.bos.redhat.com :
ganesha.nfsd-30752[dbus_heartbeat] dbus_message_entrypoint :DBUS :MAJ :Method
(AddExport) on (org.ganesha.nfsd.exportmgr) failed: name =
(org.freedesktop.DBus.Error.InvalidFileContent), message = (0 export entries in
/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf added
because (export create, block validation) errors. Details:
Config File
(/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf:7): 1
validation errors in block FSAL
Config File
(/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf:7):
Errors processing block (FSAL)
Config File
(/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf:4): 1
validation errors in block EXPORT
Config File
(/var/run/gluster/shared_storage/nfs-ganesha/exports/export.testvol.conf:4):
Errors found in configuration block EXPORT
)

Volume is not exported. The reason being volume was not started by then. Maybe
there is some issue with exporting a volume via nfs-ganesha (doing glfs_init)
before it is even started.

--- Additional comment from Ambarish on 2016-11-11 00:05:50 EST ---

Soumya,

I tried the steps after keeping my volume in "Started" state,before setting up
the Ganesha cluster and exporting the volume,twice on fresh setups,and I could
not reproduce the crash on multiple tries of system service restarts.

--- Additional comment from Soumya Koduri on 2016-11-11 01:03:25 EST ---

Thanks Amabrish. That almost confirms the theory that this crash is hit only if
a volume is being exported via nfs-ganesha before it is even started. 

Since this is not a recommended configuration, lowering the priority of the bug
for now.

I suspect that probably when the volume is not started, the flow shall be 

glfs_init() -> xlator_init() of all the child subvols -> and then
rpc_connection to brick which shall fail.

Post which "glfs_fini" shall be called. May be since glfs_init() itself failed,
graph would have not been setup and PARENT_DOWN may not have been sent to
io-threads xlator, resulting in the dangling thread.

This is just the theory I have on top of my mind. Will look through the code a
bit. CCin Pranith too.

--- Additional comment from Soumya Koduri on 2016-11-14 04:38:09 EST ---

I could reproduce the issue even with single-node nfs-ganesha server. When the
volume is not started and is being exported , there will be client-io-thread
spawned which doesn't get cleaned up properly leading to crashes at later
point. Assigning it to Pranith as per discussions with him,

--- Additional comment from Pranith Kumar K on 2016-11-20 04:10:46 EST ---

Soumya while testing my change, I encountered the following assert in
nfs-ganesha:
Thread 271 "ganesha.nfsd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f6dfb4cf700 (LWP 7303)]
0x00007f6eb2892e55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f6eb2892e55 in raise () from /lib64/libc.so.6
#1  0x00007f6eb2894a5a in abort () from /lib64/libc.so.6
#2  0x00007f6eb288b6f7 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f6eb288b7a2 in __assert_fail () from /lib64/libc.so.6
#4  0x00005560c8d508f3 in export_init ()
#5  0x00005560c8d88699 in proc_block ()
#6  0x00005560c8d88fb9 in load_config_from_node ()
#7  0x00005560c8d6053a in gsh_export_addexport ()
#8  0x00005560c8d83eef in dbus_message_entrypoint ()
#9  0x00007f6eb4463713 in _dbus_object_tree_dispatch_and_unlock () from
/lib64/libdbus-1.so.3
#10 0x00007f6eb4454d94 in dbus_connection_dispatch () from
/lib64/libdbus-1.so.3
#11 0x00007f6eb44550ad in _dbus_connection_read_write_dispatch () from
/lib64/libdbus-1.so.3
#12 0x00005560c8d84eef in gsh_dbus_thread ()
#13 0x00007f6eb30715ba in start_thread () from /lib64/libpthread.so.0
#14 0x00007f6eb29617cd in clone () from /lib64/libc.so.6
(gdb) fr 4

Here are the steps which exposed this issue:
In terminal 1:
while true; do gluster --mode=script volume stop r2 ; sleep 30 ; gluster
--mode=script volume start r2; sleep 30; done

In terminal 2:
while true; do /usr/libexec/ganesha/dbus-send.sh /etc/ganesha on r2; sleep 10
&& showmount -e localhost; /usr/libexec/ganesha/dbus-send.sh /etc/ganesha off
r2; showmount -e localhost; done

In termina 3:
watch systemctl status nfs-ganesha

Since this is not related to the bug we are discussing in this bz, I changed
the script so that we do "while true; do /usr/libexec/ganesha/dbus-send.sh
/etc/ganesha on r2; sleep 10 && showmount -e localhost;
/usr/libexec/ganesha/dbus-send.sh /etc/ganesha off r2; showmount -e localhost;
done" in a loop when the brick is online and offline. That didn't crash. So
sending the fix from gluster side. Soumya, feel free to raise a separate bug
for the "assert failure".

--- Additional comment from Worker Ant on 2016-11-20 10:36:45 CET ---

REVIEW: http://review.gluster.org/15888 (performance/io-threads: Exit threads
in fini() as well) posted (#1) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2016-11-22 08:01:21 CET ---

COMMIT: http://review.gluster.org/15888 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit 25817a8c868b6c1b8149117f13e4216a99e453aa
Author: Pranith Kumar K <pkarampu at redhat.com>
Date:   Fri Nov 18 13:30:08 2016 +0530

    performance/io-threads: Exit threads in fini() as well

    Problem:
    io-threads starts the thread in 'init()' but doesn't clean them up
    on 'fini()'. It relies on PARENT_DOWN to exit threads but there can
    be cases where event before PARENT_UP the graph init code can think
    of issuing fini(). This code path is hit when glfs_init() is called
    on a volume that is in 'stopped' state. It leads to a crash in ganesha
    process, because the io-thread tries to access freed memory.

    Fix:
    Ideal fix would be to wait for all fops in io-thread list to be completed
on
    PARENT_DOWN, and have fini() do cleanup of threads. Because there is no
proper
    documentation about how PARENT_DOWN/fini are supposed to be used,
    we are getting different kinds of sequences in different higher level
protocols.
    So for now cleaning up in both PARENT_DOWN and fini(). Fuse doesn't call
fini()
    gfapi is not calling PARENT_DOWN in some cases, so for now I don't see
    another way out.

    BUG: 1396793
    Change-Id: I9c9154e7d57198dbaff0f30d3ffc25f6d8088aec
    Signed-off-by: Pranith Kumar K <pkarampu at redhat.com>
    Reviewed-on: http://review.gluster.org/15888
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1396793
[Bug 1396793] [Ganesha] : Ganesha crashes intermittently during nfs-ganesha
restarts.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.