[Bugs] [Bug 1233019] New: glusterd :- after volume create command time out, deadlock has been observed among glusterd and all command keep failing with error "Another transaction is in progress"

bugzilla at redhat.com bugzilla at redhat.com
Thu Jun 18 05:13:12 UTC 2015


https://bugzilla.redhat.com/show_bug.cgi?id=1233019

            Bug ID: 1233019
           Summary: glusterd :- after volume create command time out,
                    deadlock has been observed among glusterd and all
                    command keep failing with error "Another transaction
                    is in progress"
           Product: GlusterFS
           Version: 3.7.1
         Component: glusterd
          Severity: high
          Assignee: kparthas at redhat.com
          Reporter: kparthas at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com
        Depends On: 1206134



+++ This bug was initially created as a clone of Bug #1206134 +++

Description of problem:
=======================
Gluster volume creation command failes with time out error and after that
gluster commands are failing with error "Another transaction is in progress" as
there is a deadlock.

Version-Release number of selected component (if applicable):
=============================================================
3.7dev-0.803.gitf64666f.el6.x86_64

How reproducible:
================
intermittent


Steps to Reproduce:
1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster

2. create a volume using below command which gave time out error :-
root at rhs-client38 ~]# gluster v create BitRot1 replica 3
rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1
rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1
rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
Error : Request timed out

3. after a while (10-15 min) while checking status found that all commands are
failing as below

[root at rhs-client38 ~]# gluster v create BitRot1 replica 3
rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1
rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1
rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
volume create: BitRot1: failed: Volume BitRot1 already exists

[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log 
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.
[root at rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please
try again after sometime.


Actual results:
===============
Due to Deadlock commands are failing saying "another  transaction is in
progress "



Additional info:
===============
(gdb) bt
#0  0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at
rpc-clnt.c:1712
#4  0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>)
at rpc-clnt.c:1634
#5  rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670
#6  0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at
rpc-clnt-ping.c:265
#7  0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183
#8  0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000032912e88fd in clone () from /lib64/libc.so.6


log snippet:-
[2015-03-26 07:35:49.746017] E
[glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management:
Volume BitRot1 already exists
[2015-03-26 07:35:57.377967] I
[glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b]
(-->
/usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652]
))))) 0-management: Lock for BitRot1 held by
d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin]
0-management: Unable to acquire lock for BitRot1

--- Additional comment from Rachana Patel on 2015-03-26 08:25:09 EDT ---

sosreport at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1206134/

--- Additional comment from Anand Avati on 2015-03-30 08:42:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#2) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-06 08:31:26 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#3) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-06 13:16:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#4) for review on master by Niels de Vos
(ndevos at redhat.com)

--- Additional comment from Anand Avati on 2015-04-07 02:27:54 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside
conn->lock) posted (#5) for review on master by Krishnan Parthasarathi
(kparthas at redhat.com)

--- Additional comment from Anand Avati on 2015-04-09 23:05:24 EDT ---

COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit d448fd187dde46bfb0d20354613912f6aa477904
Author: Krishnan Parthasarathi <kparthas at redhat.com>
Date:   Mon Feb 9 17:10:49 2015 +0530

    rpc: fix deadlock when unref is inside conn->lock

    In ping-timer implementation, the timer event takes a ref on the rpc
    object. This ref needs to be removed after every timeout event.
    ping-timer mechanism could be holding the last ref. For e.g, when a peer
    is detached and its rpc object was unref'd. In this case, ping-timer
    mechanism would try to acquire conn->mutex to perform the 'last' unref
    while being inside the critical section already. This will result in a
    deadlock.

    Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea
    BUG: 1206134
    Signed-off-by: Krishnan Parthasarathi <kparthas at redhat.com>
    Reviewed-on: http://review.gluster.org/9613
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

--- Additional comment from krishnan parthasarathi on 2015-04-10 02:50:54 EDT
---

The following link provides a test case written for GlusterFS regression test
framework. This wasn't merged in the repo since it is Linux-specific. This test
can be used as a representative for recreating this issue.

http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t

--- Additional comment from John Skeoch on 2015-04-19 20:24:48 EDT ---

User racpatel at redhat.com's account has been closed

--- Additional comment from Niels de Vos on 2015-05-14 13:27:03 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:28:34 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:35:18 EDT ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1206134
[Bug 1206134] glusterd :- after volume create command time out, deadlock
has been observed among glusterd and all command keep failing with error
"Another transaction is in progress"
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=MZLoTNWH9B&a=cc_unsubscribe


More information about the Bugs mailing list