[Bugs] [Bug 1253212] New: snapd crashed due to stack overflow

Thu Aug 13 08:57:00 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1253212

            Bug ID: 1253212
           Summary: snapd crashed due to stack overflow
           Product: GlusterFS
           Version: 3.7.0
         Component: protocol
          Severity: urgent
          Assignee: kparthas at redhat.com
          Reporter: kparthas at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    rgowdapp at redhat.com
        Depends On: 1235571, 1235582

+++ This bug was initially created as a clone of Bug #1235582 +++

+++ This bug was initially created as a clone of Bug #1235571 +++

Description of problem:

Snapd crashed on one of the Storage node.

Version-Release number of selected component (if applicable):

[root at darkknightrises ~]# rpm -qa | grep glusterfs
glusterfs-libs-3.7.1-5.el6rhs.x86_64
glusterfs-api-3.7.1-5.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-7.el6rhs.x86_64
glusterfs-3.7.1-5.el6rhs.x86_64
glusterfs-fuse-3.7.1-5.el6rhs.x86_64
glusterfs-cli-3.7.1-5.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-5.el6rhs.x86_64
glusterfs-debuginfo-3.7.1-5.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-5.el6rhs.x86_64
glusterfs-server-3.7.1-5.el6rhs.x86_64

How reproducible:

1/1

Steps to Reproduce:
1. Create 2*2 distribute replicate volume
2. Do fuse mount and start I/O from the mount point
3. Enable quota and set limit usage
4. Enable USS
5. set snap-activate-on-create enable for snapshot 
6. Enable auto-delete for snapshot 
7. set cluster.enable-shared-storage enable
8. Schedule snapshot every after every 5 mins
[root at darkknightrises ~]# snap_scheduler.py list
JOB_NAME         SCHEDULE         OPERATION        VOLUME NAME      
--------------------------------------------------------------------
job1             */5 * * * *      Snapshot Create  vol0             
9. edit scheduled time to every two minutes
10. Restart nptd service on all the storage node. 
11 Did gluster v status vol0 observers the crash

Actual results:

Snapd crashed 

Expected results:

Snapd should not crash

Additional info:

[root at darkknightrises ~]# gluster v info vol0

Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 553670d5-8f05-4d41-878e-921d59f117ae
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: node1:/rhs/brick1/b01
Brick2: node2:/rhs/brick1/b02
Brick3: node3:/rhs/brick1/b03
Brick4: node4:/rhs/brick1/04
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
features.uss: enable
features.show-snapshot-directory: enable
features.barrier: enable
auto-delete: enable
snap-activate-on-create: enable
cluster.enable-shared-storage: enable

===============================================================

[root at darkknightrises /]# gluster v status vol0
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/rhs/brick1/b01          49152     0          Y       24731
Brick node2:/rhs/brick1/b02          49152     0          Y       16359
Brick node3:/rhs/brick1/b03          49152     0          Y       10149
Brick node4:/rhs/brick1/04            49152     0          Y       5512 
Snapshot Daemon on localhost                N/A       N/A        N       25133
NFS Server on localhost                     2049      0          Y       25671
Self-heal Daemon on localhost               N/A       N/A        Y       25681
Quota Daemon on localhost                   N/A       N/A        Y       26666
Snapshot Daemon on node3             49153     0          Y       10525
NFS Server on node3                  2049      0          Y       10991
Self-heal Daemon on node3            N/A       N/A        Y       11001
Quota Daemon on node3                N/A       N/A        Y       11853
Snapshot Daemon on node4              49154     0          Y       5918 
NFS Server on node4                   2049      0          Y       6396 
Self-heal Daemon on node4             N/A       N/A        Y       6406 
Quota Daemon on node4                 N/A       N/A        Y       7255 
Snapshot Daemon on node2             49154     0          Y       16714
NFS Server on node2                  2049      0          Y       17201
Self-heal Daemon on node2            N/A       N/A        Y       17211
Quota Daemon on node2                N/A       N/A        Y       18168

bt
========================================
#44214 0x00007f473c1f5def in client_dump_version_cbk (req=<value optimized
out>, iov=<value optimized out>, count=<value optimized out>,
myframe=0x7f4024701c24) at client-handshake.c:1623
 1623                    gf_msg (frame->this->name, GF_LOG_WARNING, ENOTCONN,
 (gdb) up
 #44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0)
at rpc-clnt.c:366
 366                     trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1,
trav->frame);
 (gdb) up
 #44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at
rpc-clnt.c:383
 383             saved_frames_unwind (frames);
 (gdb) down
 #44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0)
at rpc-clnt.c:366
 366                     trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1,
trav->frame);
 (gdb) up
 #44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at
rpc-clnt.c:383
 383             saved_frames_unwind (frames);
 (gdb) up
 #44217 0x00007f475360ccdb in rpc_clnt_connection_cleanup (conn=0x7f45d0e86f40)
at rpc-clnt.c:536
 536             saved_frames_destroy (saved_frames);
 (gdb) up
 #44218 0x00007f475360d29f in rpc_clnt_notify (trans=<value optimized out>,
mydata=0x7f45d0e86f40, event=<value optimized out>, data=<value optimized out>)
at rpc-clnt.c:843
 843                     rpc_clnt_connection_cleanup (conn);
 (gdb) up
 #44219 0x00007f4753608928 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543
 543                     ret = this->notify (this, this->mydata, event, data);
 (gdb) up
 #44220 0x00007f4748370741 in socket_event_poll_err (fd=<value optimized out>,
idx=<value optimized out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>,
poll_out=0, poll_err=0)
     at socket.c:1205
 1205            rpc_transport_notify (this, RPC_TRANSPORT_DISCONNECT, this);
 (gdb) up
 #44221 socket_event_handler (fd=<value optimized out>, idx=<value optimized
out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>, poll_out=0,
poll_err=0) at socket

 ----------------------------------------
 and some frames at the top

 0  0x00007f47521ee06a in vfprintf () from /lib64/libc.so.6
 #1  0x00007f47522a9b10 in __vsnprintf_chk () from /lib64/libc.so.6
 #2  0x00007f47522a9a4a in __snprintf_chk () from /lib64/libc.so.6
 #3  0x00007f4753856a5d in snprintf (buf=0x7f4024f1bb98 "") at
/usr/include/bits/stdio2.h:65
 #4  gf_backtrace_append (buf=0x7f4024f1bb98 "") at common-utils.c:3547
 #5  gf_backtrace_fillframes (buf=0x7f4024f1bb98 "") at common-utils.c:3593
 #6  0x00007f4753856ac1 in gf_backtrace_save (buf=<value optimized out>) at
common-utils.c:3631
 #7  0x00007f475383cfc0 in _gf_log_callingfn (domain=0x7f47538add11 "",
file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info",
line=62, level=GF_LOG_ERROR, 
     fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at
logging.c:835
 #8  0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120,
alloc_ptr=0x7f401a76ffe8, size=437, type=39, typestr=0x7f47538b35cc
"gf_common_mt_asprintf") at mem-pool.c:62
 #9  0x00007f475386ddf6 in __gf_malloc (size=437, type=39,
typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146
 #10 0x00007f475386df04 in gf_vasprintf (string_ptr=0x7f401a770228,
format=0x7f47538adc22 "[%s] %s [%s:%d:%s] %s %d-%s: ", arg=<value optimized
out>) at mem-pool.c:220
 #11 0x00007f475386dfc8 in gf_asprintf (string_ptr=<value optimized out>,
format=<value optimized out>) at mem-pool.c:238
 #12 0x00007f475383d1fe in _gf_log_callingfn (domain=0x7f47538add11 "",
file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info",
line=62, level=GF_LOG_ERROR, 
     fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at
logging.c:869
 #13 0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120,
alloc_ptr=0x7f401a770488, size=437, type=39, typestr=0x7f47538b35cc
"gf_common_mt_asprintf") at mem-pool.c:62
 #14 0x00007f475386ddf6 in __gf_malloc (size=437, type=39,
typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146

--- Additional comment from Anand Avati on 2015-06-25 05:02:53 EDT ---

REVIEW: http://review.gluster.org/11399 (core: check for xl->mem_acct only if
memory accounting is enabled) posted (#2) for review on master by Krishnan
Parthasarathi (kparthas at redhat.com)

--- Additional comment from krishnan parthasarathi on 2015-06-26 05:17:20 EDT
---

RCA
----

The stack overflow was seen when older snapshots were being deleted while new
ones were being created concurently. In the setup detailed above, snapshot
scheduler creates snapshots periodically and auto-delete of snapshots is
enabled. When no. of snapshots in the system (of the volume) exceeds the
soft-limit configured, snapshots are (auto-)deleted. The crash happened when a
scheduled snapshot-create coincided with the auto-delete triggered
snapshot-delete operation.

Implementation detail
----------------------
Snapshot daemon uses gfapi interface to serve user-serviceable snapshots. gfapi
interface creates a new glfs object for every snapshot (volume) serviced. This
object is 'linked' with a global xlator object until the time glfs object is
fully initialized (i.e, set-volume operation is complete). The global xlator
object's ctx (glusterfs_ctx_t) object is being modified in a thread-unsafe
manner and could refer to a destroyed ctx (which belonged to glfs representing
a deleted snapshot).

Fix outline
------------
All initialisation managment operations (e.g, RPCs like DUMP_VERSION,
SET_VOLUME, etc.) must refer to the corresponding translator objects in the
glfs' graph.

--- Additional comment from Anand Avati on 2015-06-27 01:41:43 EDT ---

REVIEW: http://review.gluster.org/11436 (client: set THIS to client's xl in
non-FOP RPCs) posted (#1) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-07-17 04:59:53 EDT ---

REVIEW: http://review.gluster.org/11436 (rpc: add owner xlator argument to
rpc_clnt_new) posted (#2) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-08-05 13:40:52 EDT ---

REVIEW: http://review.gluster.org/11436 (rpc: add owner xlator argument to
rpc_clnt_new) posted (#3) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-08-13 02:12:59 EDT ---

COMMIT: http://review.gluster.org/11436 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit f7668938cd7745d024f3d2884e04cd744d0a69ab
Author: Krishnan Parthasarathi <kparthas at redhat.com>
Date:   Sat Jun 27 11:04:25 2015 +0530

    rpc: add owner xlator argument to rpc_clnt_new

    The @owner argument tells RPC layer the xlator that owns
    the connection and to which xlator THIS needs be set during
    network notifications like CONNECT and DISCONNECT.

    Code paths that originate from the head of a (volume) graph and use
    STACK_WIND ensure that the RPC local endpoint has the right xlator saved
    in the frame of the call (callback pair). This guarantees that the
    callback is executed in the right xlator context.

    The client handshake process which includes fetching of brick ports from
    glusterd, setting lk-version on the brick for the session, don't have
    the correct xlator set in their frames. The problem lies with RPC
    notifications. It doesn't have the provision to set THIS with the xlator
    that is registered with the corresponding RPC programs. e.g,
    RPC_CLNT_CONNECT event received by protocol/client doesn't have THIS set
    to its xlator. This implies, call(-callbacks) originating from this
    thread don't have the right xlator set too.

    The fix would be to save the xlator registered with the RPC connection
    during rpc_clnt_new. e.g, protocol/client's xlator would be saved with
    the RPC connection that it 'owns'. RPC notifications such as CONNECT,
    DISCONNECT, etc inherit THIS from the RPC connection's xlator.

    Change-Id: I9dea2c35378c511d800ef58f7fa2ea5552f2c409
    BUG: 1235582
    Signed-off-by: Krishnan Parthasarathi <kparthas at redhat.com>
    Reviewed-on: http://review.gluster.org/11436
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Tested-by: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1235571
[Bug 1235571] snapd crashed due to stack overflow
https://bugzilla.redhat.com/show_bug.cgi?id=1235582
[Bug 1235582] snapd crashed due to stack overflow
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=HQQIUUXgSG&a=cc_unsubscribe