[Bugs] [Bug 1423065] New: Deadlock in gf_timer calls and a possible core ( illegal memory access) in AFR on dropping FUSE cache
bugzilla at redhat.com
bugzilla at redhat.com
Fri Feb 17 00:56:21 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1423065
Bug ID: 1423065
Summary: Deadlock in gf_timer calls and a possible core
(illegal memory access) in AFR on dropping FUSE cache
Product: GlusterFS
Version: 3.10
Component: replicate
Assignee: bugs at gluster.org
Reporter: srangana at redhat.com
CC: bugs at gluster.org
Description of problem:
- FUSE mount point became non-responsive post a "echo 3 >
/proc/sys/vm/drop_caches" based drop of cache on the client
- On investigation, the glusterfsd process was deadlocked in the timer code
- Further this deadlock occurred due to a SEGFAULT in possibly AFR routines
Version-Release number of selected component (if applicable):
3.10
How reproducible:
Happened once, retesting as I file the bug to see if it happens again.
Steps to Reproduce:
It happened when I ran the gluster bench script from here [1]
But, on further debugging, the real cause is the dropping of cache, so I would
state, the steps to reproduce it so run a few IOZone jobs and then some
smallfile jobs, and keep dropping cache in between each test like in [1] for
this to happen.
[1]
https://github.com/gluster/gbench/blob/master/bench-tests/bt-0000-0001/GlusterBench.py
Additional info:
Providing core file details and hang details here,
NOTE Thread 10: Attempting reg->lock in timer.c and waiting there
Thread 10 (Thread 0x7f376c728700 (LWP 65253)):
#0 0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007f377412ed02 in _L_lock_791 () from /lib64/libpthread.so.0
No symbol table info available.
#2 0x00007f377412ec08 in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f37752d8475 in gf_timer_proc (data=0x7f37771a70f0) at timer.c:135
at = <optimized out>
need_cbk = 0 '\000'
now = 277103100605893
now_ts = {tv_sec = 277103, tv_nsec = 100605893}
reg = 0x7f37771a70f0
sleepts = {tv_sec = 1, tv_nsec = 0}
event = <optimized out>
tmp = <optimized out>
old_THIS = <optimized out>
#4 0x00007f377412cdc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5 0x00007f3773a7321d in clone () from /lib64/libc.so.6
No symbol table info available.
NOTE: Thread 3 SEG faulted in frame #7, but after taking the same reg->lock,
thus invoking the cleanup routine, that again tries to acquire the same lock in
frame 3
[Switching to thread 3 (Thread 0x7f375ffff700 (LWP 65262))]
#0 0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f377412ed02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007f377412ec08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f37752d8a45 in gf_timer_call_cancel (ctx=ctx at entry=0x7f3777153010,
event=0x7f374c0009b0) at timer.c:98
#4 0x00007f37752ce7a3 in gf_log_disable_suppression_before_exit
(ctx=0x7f3777153010) at logging.c:428
#5 0x00007f37752d4a55 in gf_print_trace (signum=<optimized out>,
ctx=<optimized out>) at common-utils.c:714
#6 <signal handler called>
#7 gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
#8 0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
#9 0x00007f37675c5369 in afr_forget (this=0x7f3760021f20,
inode=0x7f3758e193f0) at afr-common.c:4478
#10 0x00007f37752d8d12 in __inode_ctx_free (inode=inode at entry=0x7f3758e193f0)
at inode.c:332
#11 0x00007f37752d9ed2 in __inode_destroy (inode=0x7f3758e193f0) at inode.c:353
#12 inode_table_prune (table=table at entry=0x7f37600be5d0) at inode.c:1585
#13 0x00007f37752da1b4 in inode_unref (inode=0x7f3758e193f0) at inode.c:569
#14 0x00007f376c730731 in do_forget (this=this at entry=0x7f3777191ee0,
unique=<optimized out>, nodeid=<optimized out>, nlookup=<optimized out>) at
fuse-bridge.c:623
#15 0x00007f376c7307ca in fuse_batch_forget (this=0x7f3777191ee0,
finh=0x7f3741667960, msg=0x7f3741667988) at fuse-bridge.c:661
#16 0x00007f376c74652e in fuse_thread_proc (data=0x7f3777191ee0) at
fuse-bridge.c:5068
#17 0x00007f377412cdc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3773a7321d in clone () from /lib64/libc.so.6
NOTE: The seg fault in frame #7 seems to be due to bad inode context pointer
from AFR, when forgetting the inodes,
#6 <signal handler called>
No symbol table info available.
#7 gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
reg = 0x7f37771a70f0
fired = _gf_false
__FUNCTION__ = "gf_timer_call_cancel"
#8 0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
ctx = 0x7f374642d980
ret = <optimized out>
__FUNCTION__ = "afr_spb_choice_timeout_cancel"
#9 0x00007f37675c5369 in afr_forget (this=0x7f3760021f20,
inode=0x7f3758e193f0) at afr-common.c:4478
ctx_int = 0
ctx = 0x0
#10 0x00007f37752d8d12 in __inode_ctx_free (inode=inode at entry=0x7f3758e193f0)
at inode.c:332
index = 2
xl = 0x7f3760021f20
old_THIS = 0x7f3777191ee0
__FUNCTION__ = "__inode_ctx_free"
#11 0x00007f37752d9ed2 in __inode_destroy (inode=0x7f3758e193f0) at inode.c:353
No locals.
#12 inode_table_prune (table=table at entry=0x7f37600be5d0) at inode.c:1585
Some further context:
(gdb) f 7
#7 gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
100 fired = event->fired;
(gdb) p event
$1 = (gf_timer_t *) 0x2b1a134558a5f533
(gdb) p *event
Cannot access memory at address 0x2b1a134558a5f533
(gdb) f 8
#8 0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
713 gf_timer_call_cancel (this->ctx, ctx->timer);
(gdb) p ctx
$2 = (afr_inode_ctx_t *) 0x7f374642d980
(gdb) p *ctx
$3 = {read_subvol = 4379182848, spb_choice = -1, timer = 0x2b1a134558a5f533,
need_refresh = (_gf_true | unknown: 1487271126)}
(gdb) f 7
#7 gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
100 fired = event->fired;
(gdb) info registers
rax 0x0 0
rbx 0x2b1a134558a5f533 3105816081586320691
rcx 0x0 0
<edited>
(gdb) disassemble
Dump of assembler code for function gf_timer_call_cancel:
<edited>
0x00007f37752d89b3 <+115>: je 0x7f37752d8a40
<gf_timer_call_cancel+256>
0x00007f37752d89b9 <+121>: callq 0x7f37752c1a80 <pthread_spin_lock at plt>
=> 0x00007f37752d89be <+126>: mov 0x38(%rbx),%r13d
0x00007f37752d89c2 <+130>: test %r13d,%r13d
0x00007f37752d89c5 <+133>: jne 0x7f37752d89e9
<gf_timer_call_cancel+169>
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list