[Bugs] [Bug 1233019] New: glusterd :- after volume create command time out, deadlock has been observed among glusterd and all command keep failing with error "Another transaction is in progress"

Thu Jun 18 05:13:12 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1233019

            Bug ID: 1233019
           Summary: glusterd :- after volume create command time out,
                    deadlock has been observed among glusterd and all
                    command keep failing with error "Another transaction
                    is in progress"
           Product: GlusterFS
           Version: 3.7.1
         Component: glusterd
          Severity: high
          Assignee: kparthas at redhat.com
          Reporter: kparthas at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com
        Depends On: 1206134

+++ This bug was initially created as a clone of Bug #1206134 +++

Description of problem:
=======================
Gluster volume creation command failes with time out error and after that
gluster commands are failing with error "Another transaction is in progress" as
there is a deadlock.

Version-Release number of selected component (if applicable):
=============================================================
3.7dev-0.803.gitf64666f.el6.x86_64

How reproducible:
================
intermittent

Steps to Reproduce:
1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster

2. create a volume using below command which gave time out error :-
root at rhs-client38 ~]# gluster v create BitRot1 replica 3
rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1
rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1
rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
Error : Request timed out

3. after a while (10-15 min) while checking status found that all commands are
failing as below

[root at rhs-client38 ~]# gluster v create BitRot1 replica 3
rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1
rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1
rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
volume create: BitRot1: failed: Volume BitRot1 already exists

[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log 
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.

Actual results:
===============
Due to Deadlock commands are failing saying "another  transaction is in
progress "

Additional info:
===============
(gdb) bt
#0  0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at
rpc-clnt.c:1712
#4  0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>)
at rpc-clnt.c:1634
#5  rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670
#6  0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at
rpc-clnt-ping.c:265
#7  0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183
#8  0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000032912e88fd in clone () from /lib64/libc.so.6

log snippet:-
[2015-03-26 07:35:49.746017] E
[glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management:
Volume BitRot1 already exists
[2015-03-26 07:35:57.377967] I
[glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1

--- Additional comment from Rachana Patel on 2015-03-26 08:25:09 EDT ---

sosreport at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1206134/

--- Additional comment from Anand Avati on 2015-03-30 08:42:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#2) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-06 08:31:26 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#3) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-06 13:16:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#4) for review on master by Niels de Vos
(ndevos at redhat.com)

--- Additional comment from Anand Avati on 2015-04-07 02:27:54 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#5) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-09 23:05:24 EDT ---

COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit d448fd187dde46bfb0d20354613912f6aa477904
Author: Krishnan Parthasarathi <kparthas at redhat.com>
Date:   Mon Feb 9 17:10:49 2015 +0530

    rpc: fix deadlock when unref is inside conn->lock

    In ping-timer implementation, the timer event takes a ref on the rpc
    object. This ref needs to be removed after every timeout event.
    ping-timer mechanism could be holding the last ref. For e.g, when a peer
    is detached and its rpc object was unref'd. In this case, ping-timer
    mechanism would try to acquire conn->mutex to perform the 'last' unref
    while being inside the critical section already. This will result in a
    deadlock.

    Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea
    BUG: 1206134
    Signed-off-by: Krishnan Parthasarathi <kparthas at redhat.com>
    Reviewed-on: http://review.gluster.org/9613
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

--- Additional comment from krishnan parthasarathi on 2015-04-10 02:50:54 EDT
---

The following link provides a test case written for GlusterFS regression test
framework. This wasn't merged in the repo since it is Linux-specific. This test
can be used as a representative for recreating this issue.

http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t

--- Additional comment from John Skeoch on 2015-04-19 20:24:48 EDT ---

User racpatel at redhat.com's account has been closed

--- Additional comment from Niels de Vos on 2015-05-14 13:27:03 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:28:34 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:35:18 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1206134
[Bug 1206134] glusterd :- after volume create command time out, deadlock
has been observed among glusterd and all command keep failing with error
"Another transaction is in progress"
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=MZLoTNWH9B&a=cc_unsubscribe