[Bugs] [Bug 1423065] New: Deadlock in gf_timer calls and a possible core ( illegal memory access) in AFR on dropping FUSE cache

bugzilla at redhat.com bugzilla at redhat.com
Fri Feb 17 00:56:21 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1423065

            Bug ID: 1423065
           Summary: Deadlock in gf_timer calls and a possible core
                    (illegal memory access) in AFR on dropping FUSE cache
           Product: GlusterFS
           Version: 3.10
         Component: replicate
          Assignee: bugs at gluster.org
          Reporter: srangana at redhat.com
                CC: bugs at gluster.org



Description of problem:
- FUSE mount point became non-responsive post a "echo 3 >
/proc/sys/vm/drop_caches" based drop of cache on the client

- On investigation, the glusterfsd process was deadlocked in the timer code

- Further this deadlock occurred due to a SEGFAULT in possibly AFR routines

Version-Release number of selected component (if applicable):
3.10

How reproducible:
Happened once, retesting as I file the bug to see if it happens again.

Steps to Reproduce:
It happened when I ran the gluster bench script from here [1]

But, on further debugging, the real cause is the dropping of cache, so I would
state, the steps to reproduce it so run a few IOZone jobs and then some
smallfile jobs, and keep dropping cache in between each test like in [1] for
this to happen.

[1]
https://github.com/gluster/gbench/blob/master/bench-tests/bt-0000-0001/GlusterBench.py

Additional info:

Providing core file details and hang details here,

NOTE Thread 10: Attempting reg->lock in timer.c and waiting there
Thread 10 (Thread 0x7f376c728700 (LWP 65253)):
#0  0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f377412ed02 in _L_lock_791 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00007f377412ec08 in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f37752d8475 in gf_timer_proc (data=0x7f37771a70f0) at timer.c:135
        at = <optimized out>
        need_cbk = 0 '\000'
        now = 277103100605893
        now_ts = {tv_sec = 277103, tv_nsec = 100605893}
        reg = 0x7f37771a70f0
        sleepts = {tv_sec = 1, tv_nsec = 0}
        event = <optimized out>
        tmp = <optimized out>
        old_THIS = <optimized out>
#4  0x00007f377412cdc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00007f3773a7321d in clone () from /lib64/libc.so.6
No symbol table info available.

NOTE: Thread 3 SEG faulted in frame #7, but after taking the same reg->lock,
thus invoking the cleanup routine, that again tries to acquire the same lock in
frame 3

[Switching to thread 3 (Thread 0x7f375ffff700 (LWP 65262))]
#0  0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f3774132f4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f377412ed02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f377412ec08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f37752d8a45 in gf_timer_call_cancel (ctx=ctx at entry=0x7f3777153010,
event=0x7f374c0009b0) at timer.c:98
#4  0x00007f37752ce7a3 in gf_log_disable_suppression_before_exit
(ctx=0x7f3777153010) at logging.c:428
#5  0x00007f37752d4a55 in gf_print_trace (signum=<optimized out>,
ctx=<optimized out>) at common-utils.c:714
#6  <signal handler called>
#7  gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
#8  0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
#9  0x00007f37675c5369 in afr_forget (this=0x7f3760021f20,
inode=0x7f3758e193f0) at afr-common.c:4478
#10 0x00007f37752d8d12 in __inode_ctx_free (inode=inode at entry=0x7f3758e193f0)
at inode.c:332
#11 0x00007f37752d9ed2 in __inode_destroy (inode=0x7f3758e193f0) at inode.c:353
#12 inode_table_prune (table=table at entry=0x7f37600be5d0) at inode.c:1585
#13 0x00007f37752da1b4 in inode_unref (inode=0x7f3758e193f0) at inode.c:569
#14 0x00007f376c730731 in do_forget (this=this at entry=0x7f3777191ee0,
unique=<optimized out>, nodeid=<optimized out>, nlookup=<optimized out>) at
fuse-bridge.c:623
#15 0x00007f376c7307ca in fuse_batch_forget (this=0x7f3777191ee0,
finh=0x7f3741667960, msg=0x7f3741667988) at fuse-bridge.c:661
#16 0x00007f376c74652e in fuse_thread_proc (data=0x7f3777191ee0) at
fuse-bridge.c:5068
#17 0x00007f377412cdc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3773a7321d in clone () from /lib64/libc.so.6

NOTE: The seg fault in frame #7 seems to be due to bad inode context pointer
from AFR, when forgetting the inodes,

#6  <signal handler called>
No symbol table info available.
#7  gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
        reg = 0x7f37771a70f0
        fired = _gf_false
        __FUNCTION__ = "gf_timer_call_cancel"
#8  0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
        ctx = 0x7f374642d980
        ret = <optimized out>
        __FUNCTION__ = "afr_spb_choice_timeout_cancel"
#9  0x00007f37675c5369 in afr_forget (this=0x7f3760021f20,
inode=0x7f3758e193f0) at afr-common.c:4478
        ctx_int = 0
        ctx = 0x0
#10 0x00007f37752d8d12 in __inode_ctx_free (inode=inode at entry=0x7f3758e193f0)
at inode.c:332
        index = 2
        xl = 0x7f3760021f20
        old_THIS = 0x7f3777191ee0
        __FUNCTION__ = "__inode_ctx_free"
#11 0x00007f37752d9ed2 in __inode_destroy (inode=0x7f3758e193f0) at inode.c:353
No locals.
#12 inode_table_prune (table=table at entry=0x7f37600be5d0) at inode.c:1585

Some further context:
(gdb) f 7
#7  gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
100                     fired = event->fired;
(gdb) p event
$1 = (gf_timer_t *) 0x2b1a134558a5f533
(gdb) p *event
Cannot access memory at address 0x2b1a134558a5f533
(gdb) f 8
#8  0x00007f37675c5263 in afr_spb_choice_timeout_cancel
(this=this at entry=0x7f3760021f20, inode=inode at entry=0x7f3758e193f0) at
afr-common.c:713
713                             gf_timer_call_cancel (this->ctx, ctx->timer);
(gdb) p ctx
$2 = (afr_inode_ctx_t *) 0x7f374642d980
(gdb) p *ctx
$3 = {read_subvol = 4379182848, spb_choice = -1, timer = 0x2b1a134558a5f533,
need_refresh = (_gf_true | unknown: 1487271126)}
(gdb) f 7
#7  gf_timer_call_cancel (ctx=0x7f3777153010, event=0x2b1a134558a5f533) at
timer.c:100
100                     fired = event->fired;
(gdb) info registers
rax            0x0      0
rbx            0x2b1a134558a5f533       3105816081586320691
rcx            0x0      0
<edited>

(gdb) disassemble 
Dump of assembler code for function gf_timer_call_cancel:
<edited>
   0x00007f37752d89b3 <+115>:   je     0x7f37752d8a40
<gf_timer_call_cancel+256>
   0x00007f37752d89b9 <+121>:   callq  0x7f37752c1a80 <pthread_spin_lock at plt>
=> 0x00007f37752d89be <+126>:   mov    0x38(%rbx),%r13d
   0x00007f37752d89c2 <+130>:   test   %r13d,%r13d
   0x00007f37752d89c5 <+133>:   jne    0x7f37752d89e9
<gf_timer_call_cancel+169>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list