[Bugs] [Bug 1253212] New: snapd crashed due to stack overflow
bugzilla at redhat.com
bugzilla at redhat.com
Thu Aug 13 08:57:00 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1253212
Bug ID: 1253212
Summary: snapd crashed due to stack overflow
Product: GlusterFS
Version: 3.7.0
Component: protocol
Severity: urgent
Assignee: kparthas at redhat.com
Reporter: kparthas at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
rgowdapp at redhat.com
Depends On: 1235571, 1235582
+++ This bug was initially created as a clone of Bug #1235582 +++
+++ This bug was initially created as a clone of Bug #1235571 +++
Description of problem:
Snapd crashed on one of the Storage node.
Version-Release number of selected component (if applicable):
[root at darkknightrises ~]# rpm -qa | grep glusterfs
glusterfs-libs-3.7.1-5.el6rhs.x86_64
glusterfs-api-3.7.1-5.el6rhs.x86_64
samba-vfs-glusterfs-4.1.17-7.el6rhs.x86_64
glusterfs-3.7.1-5.el6rhs.x86_64
glusterfs-fuse-3.7.1-5.el6rhs.x86_64
glusterfs-cli-3.7.1-5.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-5.el6rhs.x86_64
glusterfs-debuginfo-3.7.1-5.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-5.el6rhs.x86_64
glusterfs-server-3.7.1-5.el6rhs.x86_64
How reproducible:
1/1
Steps to Reproduce:
1. Create 2*2 distribute replicate volume
2. Do fuse mount and start I/O from the mount point
3. Enable quota and set limit usage
4. Enable USS
5. set snap-activate-on-create enable for snapshot
6. Enable auto-delete for snapshot
7. set cluster.enable-shared-storage enable
8. Schedule snapshot every after every 5 mins
[root at darkknightrises ~]# snap_scheduler.py list
JOB_NAME SCHEDULE OPERATION VOLUME NAME
--------------------------------------------------------------------
job1 */5 * * * * Snapshot Create vol0
9. edit scheduled time to every two minutes
10. Restart nptd service on all the storage node.
11 Did gluster v status vol0 observers the crash
Actual results:
Snapd crashed
Expected results:
Snapd should not crash
Additional info:
[root at darkknightrises ~]# gluster v info vol0
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 553670d5-8f05-4d41-878e-921d59f117ae
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: node1:/rhs/brick1/b01
Brick2: node2:/rhs/brick1/b02
Brick3: node3:/rhs/brick1/b03
Brick4: node4:/rhs/brick1/04
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
features.uss: enable
features.show-snapshot-directory: enable
features.barrier: enable
auto-delete: enable
snap-activate-on-create: enable
cluster.enable-shared-storage: enable
===============================================================
[root at darkknightrises /]# gluster v status vol0
Status of volume: vol0
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick node1:/rhs/brick1/b01 49152 0 Y 24731
Brick node2:/rhs/brick1/b02 49152 0 Y 16359
Brick node3:/rhs/brick1/b03 49152 0 Y 10149
Brick node4:/rhs/brick1/04 49152 0 Y 5512
Snapshot Daemon on localhost N/A N/A N 25133
NFS Server on localhost 2049 0 Y 25671
Self-heal Daemon on localhost N/A N/A Y 25681
Quota Daemon on localhost N/A N/A Y 26666
Snapshot Daemon on node3 49153 0 Y 10525
NFS Server on node3 2049 0 Y 10991
Self-heal Daemon on node3 N/A N/A Y 11001
Quota Daemon on node3 N/A N/A Y 11853
Snapshot Daemon on node4 49154 0 Y 5918
NFS Server on node4 2049 0 Y 6396
Self-heal Daemon on node4 N/A N/A Y 6406
Quota Daemon on node4 N/A N/A Y 7255
Snapshot Daemon on node2 49154 0 Y 16714
NFS Server on node2 2049 0 Y 17201
Self-heal Daemon on node2 N/A N/A Y 17211
Quota Daemon on node2 N/A N/A Y 18168
bt
========================================
#44214 0x00007f473c1f5def in client_dump_version_cbk (req=<value optimized
out>, iov=<value optimized out>, count=<value optimized out>,
myframe=0x7f4024701c24) at client-handshake.c:1623
1623 gf_msg (frame->this->name, GF_LOG_WARNING, ENOTCONN,
(gdb) up
#44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0)
at rpc-clnt.c:366
366 trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1,
trav->frame);
(gdb) up
#44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at
rpc-clnt.c:383
383 saved_frames_unwind (frames);
(gdb) down
#44215 0x00007f475360cb1e in saved_frames_unwind (saved_frames=0x7f45d80022d0)
at rpc-clnt.c:366
366 trav->rpcreq->cbkfn (trav->rpcreq, &iov, 1,
trav->frame);
(gdb) up
#44216 0x00007f475360cc0e in saved_frames_destroy (frames=0x7f45d80022d0) at
rpc-clnt.c:383
383 saved_frames_unwind (frames);
(gdb) up
#44217 0x00007f475360ccdb in rpc_clnt_connection_cleanup (conn=0x7f45d0e86f40)
at rpc-clnt.c:536
536 saved_frames_destroy (saved_frames);
(gdb) up
#44218 0x00007f475360d29f in rpc_clnt_notify (trans=<value optimized out>,
mydata=0x7f45d0e86f40, event=<value optimized out>, data=<value optimized out>)
at rpc-clnt.c:843
843 rpc_clnt_connection_cleanup (conn);
(gdb) up
#44219 0x00007f4753608928 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543
543 ret = this->notify (this, this->mydata, event, data);
(gdb) up
#44220 0x00007f4748370741 in socket_event_poll_err (fd=<value optimized out>,
idx=<value optimized out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>,
poll_out=0, poll_err=0)
at socket.c:1205
1205 rpc_transport_notify (this, RPC_TRANSPORT_DISCONNECT, this);
(gdb) up
#44221 socket_event_handler (fd=<value optimized out>, idx=<value optimized
out>, data=0x7f45d0ee01c0, poll_in=<value optimized out>, poll_out=0,
poll_err=0) at socket
----------------------------------------
and some frames at the top
0 0x00007f47521ee06a in vfprintf () from /lib64/libc.so.6
#1 0x00007f47522a9b10 in __vsnprintf_chk () from /lib64/libc.so.6
#2 0x00007f47522a9a4a in __snprintf_chk () from /lib64/libc.so.6
#3 0x00007f4753856a5d in snprintf (buf=0x7f4024f1bb98 "") at
/usr/include/bits/stdio2.h:65
#4 gf_backtrace_append (buf=0x7f4024f1bb98 "") at common-utils.c:3547
#5 gf_backtrace_fillframes (buf=0x7f4024f1bb98 "") at common-utils.c:3593
#6 0x00007f4753856ac1 in gf_backtrace_save (buf=<value optimized out>) at
common-utils.c:3631
#7 0x00007f475383cfc0 in _gf_log_callingfn (domain=0x7f47538add11 "",
file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info",
line=62, level=GF_LOG_ERROR,
fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at
logging.c:835
#8 0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120,
alloc_ptr=0x7f401a76ffe8, size=437, type=39, typestr=0x7f47538b35cc
"gf_common_mt_asprintf") at mem-pool.c:62
#9 0x00007f475386ddf6 in __gf_malloc (size=437, type=39,
typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146
#10 0x00007f475386df04 in gf_vasprintf (string_ptr=0x7f401a770228,
format=0x7f47538adc22 "[%s] %s [%s:%d:%s] %s %d-%s: ", arg=<value optimized
out>) at mem-pool.c:220
#11 0x00007f475386dfc8 in gf_asprintf (string_ptr=<value optimized out>,
format=<value optimized out>) at mem-pool.c:238
#12 0x00007f475383d1fe in _gf_log_callingfn (domain=0x7f47538add11 "",
file=<value optimized out>, function=0x7f47538b3850 "gf_mem_set_acct_info",
line=62, level=GF_LOG_ERROR,
fmt=0x7f47538b3718 "Assertion failed: xl->mem_acct != NULL") at
logging.c:869
#13 0x00007f475386dd34 in gf_mem_set_acct_info (xl=0x7f4753ad4120,
alloc_ptr=0x7f401a770488, size=437, type=39, typestr=0x7f47538b35cc
"gf_common_mt_asprintf") at mem-pool.c:62
#14 0x00007f475386ddf6 in __gf_malloc (size=437, type=39,
typestr=0x7f47538b35cc "gf_common_mt_asprintf") at mem-pool.c:146
--- Additional comment from Anand Avati on 2015-06-25 05:02:53 EDT ---
REVIEW: http://review.gluster.org/11399 (core: check for xl->mem_acct only if
memory accounting is enabled) posted (#2) for review on master by Krishnan
Parthasarathi (kparthas at redhat.com)
--- Additional comment from krishnan parthasarathi on 2015-06-26 05:17:20 EDT
---
RCA
----
The stack overflow was seen when older snapshots were being deleted while new
ones were being created concurently. In the setup detailed above, snapshot
scheduler creates snapshots periodically and auto-delete of snapshots is
enabled. When no. of snapshots in the system (of the volume) exceeds the
soft-limit configured, snapshots are (auto-)deleted. The crash happened when a
scheduled snapshot-create coincided with the auto-delete triggered
snapshot-delete operation.
Implementation detail
----------------------
Snapshot daemon uses gfapi interface to serve user-serviceable snapshots. gfapi
interface creates a new glfs object for every snapshot (volume) serviced. This
object is 'linked' with a global xlator object until the time glfs object is
fully initialized (i.e, set-volume operation is complete). The global xlator
object's ctx (glusterfs_ctx_t) object is being modified in a thread-unsafe
manner and could refer to a destroyed ctx (which belonged to glfs representing
a deleted snapshot).
Fix outline
------------
All initialisation managment operations (e.g, RPCs like DUMP_VERSION,
SET_VOLUME, etc.) must refer to the corresponding translator objects in the
glfs' graph.
--- Additional comment from Anand Avati on 2015-06-27 01:41:43 EDT ---
REVIEW: http://review.gluster.org/11436 (client: set THIS to client's xl in
non-FOP RPCs) posted (#1) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)
--- Additional comment from Anand Avati on 2015-07-17 04:59:53 EDT ---
REVIEW: http://review.gluster.org/11436 (rpc: add owner xlator argument to
rpc_clnt_new) posted (#2) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)
--- Additional comment from Anand Avati on 2015-08-05 13:40:52 EDT ---
REVIEW: http://review.gluster.org/11436 (rpc: add owner xlator argument to
rpc_clnt_new) posted (#3) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)
--- Additional comment from Anand Avati on 2015-08-13 02:12:59 EDT ---
COMMIT: http://review.gluster.org/11436 committed in master by Raghavendra G
(rgowdapp at redhat.com)
------
commit f7668938cd7745d024f3d2884e04cd744d0a69ab
Author: Krishnan Parthasarathi <kparthas at redhat.com>
Date: Sat Jun 27 11:04:25 2015 +0530
rpc: add owner xlator argument to rpc_clnt_new
The @owner argument tells RPC layer the xlator that owns
the connection and to which xlator THIS needs be set during
network notifications like CONNECT and DISCONNECT.
Code paths that originate from the head of a (volume) graph and use
STACK_WIND ensure that the RPC local endpoint has the right xlator saved
in the frame of the call (callback pair). This guarantees that the
callback is executed in the right xlator context.
The client handshake process which includes fetching of brick ports from
glusterd, setting lk-version on the brick for the session, don't have
the correct xlator set in their frames. The problem lies with RPC
notifications. It doesn't have the provision to set THIS with the xlator
that is registered with the corresponding RPC programs. e.g,
RPC_CLNT_CONNECT event received by protocol/client doesn't have THIS set
to its xlator. This implies, call(-callbacks) originating from this
thread don't have the right xlator set too.
The fix would be to save the xlator registered with the RPC connection
during rpc_clnt_new. e.g, protocol/client's xlator would be saved with
the RPC connection that it 'owns'. RPC notifications such as CONNECT,
DISCONNECT, etc inherit THIS from the RPC connection's xlator.
Change-Id: I9dea2c35378c511d800ef58f7fa2ea5552f2c409
BUG: 1235582
Signed-off-by: Krishnan Parthasarathi <kparthas at redhat.com>
Reviewed-on: http://review.gluster.org/11436
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Tested-by: NetBSD Build System <jenkins at build.gluster.org>
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1235571
[Bug 1235571] snapd crashed due to stack overflow
https://bugzilla.redhat.com/show_bug.cgi?id=1235582
[Bug 1235582] snapd crashed due to stack overflow
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=HQQIUUXgSG&a=cc_unsubscribe
More information about the Bugs
mailing list