[Bugs] [Bug 1445408] New: gluster volume stop hangs

Tue Apr 25 15:35:20 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1445408

            Bug ID: 1445408
           Summary: gluster volume stop hangs
           Product: GlusterFS
           Version: 3.10
         Component: glusterd
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: bugs at gluster.org
        Depends On: 1441910
            Blocks: 1441932

+++ This bug was initially created as a clone of Bug #1441910 +++

Description of problem:

While I was testing some of the glusterd basic commands, I ended up in a
situation where volume stop hung:

(gdb) t a a bt

Thread 8 (Thread 0x7f3f213d5700 (LWP 31710)):
#0  0x00007f3f28a56dd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f3f2a3748ef in event_dispatch_epoll_worker (data=0x250df70) at
event-epoll.c:665
#2  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f3f21bd6700 (LWP 31709)):
#0  0x00007f3f29182bc0 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f3f2539349b in hooks_worker (args=<optimized out>) at
glusterd-hooks.c:529
#2  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f3f25e4e700 (LWP 31581)):
#0  0x00007f3f28a1c14d in nanosleep () from /lib64/libc.so.6
#1  0x00007f3f28a1c09a in sleep () from /lib64/libc.so.6
#2  0x00007f3f252f92e2 in glusterd_wait_for_blockers (priv=0x7f3f2a653050) at
glusterd-op-sm.c:6052
#3  0x00007f3f253014ec in glusterd_op_commit_perform
(op=op at entry=GD_OP_STOP_VOLUME, 
    dict=dict at entry=0x7f3f100790b0, op_errstr=op_errstr at entry=0x7f3f1840c040, 
    rsp_dict=rsp_dict at entry=0x7f3f10004f30) at glusterd-op-sm.c:6075
#4  0x00007f3f25390c6d in gd_commit_op_phase (op=GD_OP_STOP_VOLUME,
op_ctx=op_ctx at entry=0x7f3f1c01c860, 
    req_dict=0x7f3f100790b0, op_errstr=op_errstr at entry=0x7f3f1840c040,
txn_opinfo=txn_opinfo at entry=0x7f3f1840c060)
    at glusterd-syncop.c:1413
#5  0x00007f3f253920ed in gd_sync_task_begin
(op_ctx=op_ctx at entry=0x7f3f1c01c860, req=req at entry=0x7f3f18006ee0)
    at glusterd-syncop.c:1942
#6  0x00007f3f253923bc in glusterd_op_begin_synctask
(req=req at entry=0x7f3f18006ee0, op=op at entry=GD_OP_STOP_VOLUME, 
    dict=0x7f3f1c01c860) at glusterd-syncop.c:2007
#7  0x00007f3f2537bdac in __glusterd_handle_cli_stop_volume
(req=req at entry=0x7f3f18006ee0)
    at glusterd-volume-ops.c:628
#8  0x00007f3f252ec7bd in glusterd_big_locked_handler (req=0x7f3f18006ee0, 
    actor_fn=0x7f3f2537bbd0 <__glusterd_handle_cli_stop_volume>) at
glusterd-handler.c:81
---Type <return> to continue, or q <return> to quit---
#9  0x00007f3f2a355f20 in synctask_wrap () at syncop.c:375
#10 0x00007f3f2899c2c0 in ?? () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f3f2664f700 (LWP 31580)):
#0  0x00007f3f29182f69 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f3f2a357d49 in syncenv_task (proc=proc at entry=0x24f5c10) at
syncop.c:603
#2  0x00007f3f2a358920 in syncenv_processor (thdata=0x24f5c10) at syncop.c:695
#3  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#4  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f3f26e50700 (LWP 31579)):
#0  0x00007f3f28a1c14d in nanosleep () from /lib64/libc.so.6
#1  0x00007f3f28a1c09a in sleep () from /lib64/libc.so.6
#2  0x00007f3f2a34750a in pool_sweeper (arg=<optimized out>) at mem-pool.c:465
#3  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#4  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f3f27651700 (LWP 31578)):
#0  0x00007f3f291869d6 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004085c7 in glusterfs_sigwaiter (arg=<optimized out>) at
glusterfsd.c:2095
#2  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f3f27e52700 (LWP 31577)):
#0  0x00007f3f2918648d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f3f2a32f2e6 in gf_timer_proc (data=0x24f37d0) at timer.c:164
#2  0x00007f3f2917d5ba in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3f28a567cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f3f2a7fe780 (LWP 31576)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f3f2917e6ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f3f2a374e48 in event_dispatch_epoll (event_pool=0x24ecf30) at
event-epoll.c:759
#2  0x00000000004059b0 in main (argc=<optimized out>, argv=<optimized out>) at
glusterfsd.c:2505
(gdb) t 6
[Switching to thread 6 (Thread 0x7f3f25e4e700 (LWP 31581))]
#0  0x00007f3f28a1c14d in nanosleep () from /lib64/libc.so.6
(gdb) f 4
#4  0x00007f3f25390c6d in gd_commit_op_phase (op=GD_OP_STOP_VOLUME,
op_ctx=op_ctx at entry=0x7f3f1c01c860, 
    req_dict=0x7f3f100790b0, op_errstr=op_errstr at entry=0x7f3f1840c040,
txn_opinfo=txn_opinfo at entry=0x7f3f1840c060)
    at glusterd-syncop.c:1413
1413            ret = glusterd_op_commit_perform (op, req_dict, op_errstr,
rsp_dict);
(gdb) f 3
#3  0x00007f3f253014ec in glusterd_op_commit_perform
(op=op at entry=GD_OP_STOP_VOLUME, 
    dict=dict at entry=0x7f3f100790b0, op_errstr=op_errstr at entry=0x7f3f1840c040, 
    rsp_dict=rsp_dict at entry=0x7f3f10004f30) at glusterd-op-sm.c:6075
6075                            glusterd_wait_for_blockers (this->private);
(gdb) f 2
#2  0x00007f3f252f92e2 in glusterd_wait_for_blockers (priv=0x7f3f2a653050) at
glusterd-op-sm.c:6052
6052                    sleep (1);
(gdb) p priv.blockers
$1 = 4294967294

priv.blockers shoots up with a big number, this counter was introduced in
https://review.gluster.org/#/c/16927 . Further debugging to continue.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

--- Additional comment from Worker Ant on 2017-04-13 03:59:47 EDT ---

REVIEW: https://review.gluster.org/17055 (glusterd: fix
glusterd_wait_for_blockers to go in infinite loop) posted (#1) for review on
master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-13 08:11:11 EDT ---

REVIEW: https://review.gluster.org/17055 (glusterd: fix
glusterd_wait_for_blockers to go in infinite loop) posted (#2) for review on
master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-13 08:11:18 EDT ---

REVIEW: https://review.gluster.org/17055 (glusterd: fix
glusterd_wait_for_blockers to go in infinite loop) posted (#3) for review on
master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-13 14:15:17 EDT ---

COMMIT: https://review.gluster.org/17055 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 090c8866eb3ae174be50dec8d9d5ecf978d18a45
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Thu Apr 13 13:20:18 2017 +0530

    glusterd: fix glusterd_wait_for_blockers to go in infinite loop

    In send_attach_req () conf->blockers is bumped up before
    rpc_clnt_submit however the same is bumped down twice, one from the
    callback and one from the negative ret handling which can very well be a
    possible case if the rpc submit fails.

    Change-Id: Icb820694034cbfcb3d427911e192ac4a0f4540f6
    BUG: 1441910
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: https://review.gluster.org/17055
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1441910
[Bug 1441910] gluster volume stop hangs
https://bugzilla.redhat.com/show_bug.cgi?id=1441932
[Bug 1441932] Gluster operations fails with another transaction in progress
as volume delete acquires lock and won't release
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.