[Bugs] [Bug 1729463] New: gluster v geo-rep status command timing out
bugzilla at redhat.com
bugzilla at redhat.com
Fri Jul 12 11:06:34 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1729463
Bug ID: 1729463
Summary: gluster v geo-rep status command timing out
Product: GlusterFS
Version: mainline
Status: NEW
Component: glusterd
Keywords: Regression
Severity: high
Assignee: bugs at gluster.org
Reporter: srakonde at redhat.com
CC: amukherj at redhat.com, avishwan at redhat.com,
bmekala at redhat.com, bugs at gluster.org,
csaba at redhat.com, khiremat at redhat.com,
kiyer at redhat.com, nchilaka at redhat.com,
rallan at redhat.com, rhs-bugs at redhat.com,
sankarshan at redhat.com, srakonde at redhat.com,
storage-qa-internal at redhat.com, vbellur at redhat.com,
vdas at redhat.com
Depends On: 1727785
Blocks: 1696809
Target Milestone: ---
Classification: Community
Description of problem:
=====================
I am trying to fetch the status of geo-rep , and didn't mention the
master-slave details.
While giving such command, it is timing out
[root at rhs-gp-srv5 ~]# gluster v geo-rep status
Error : Request timed out
geo-replication command failed
Version-Release number of selected component (if applicable):
======================
6.0.7
How reproducible:
=================
consistent
Steps to Reproduce:
1. created a georep session b/w 4x3 and 4x(6+2) ec volume
2. started the session
3. status works fine if we give master-slave details as part of command
Actual results:
===========
but times out if we don't give master-salve details
Expected results:
================
shouldn't timeout
--- Additional comment from nchilaka on 2019-07-08 14:10:55 IST ---
glusterd log o/p on node where command was triggered
[2019-07-08 08:38:08.229980] I [MSGID: 106061]
[glusterd-geo-rep.c:308:__glusterd_handle_gsync_set] 0-management: master not
found, while handling geo-replication options
[2019-07-08 08:38:08.230020] I [MSGID: 106061]
[glusterd-geo-rep.c:315:__glusterd_handle_gsync_set] 0-management: slave not
found, while handling geo-replication options
[2019-07-08 08:38:08.230079] W [MSGID: 106061]
[glusterd-geo-rep.c:2388:glusterd_op_gsync_args_get] 0-management: master not
found
[2019-07-08 08:38:08.344099] I [MSGID: 106327]
[glusterd-geo-rep.c:4653:glusterd_read_status_file] 0-management: Using passed
config
template(/var/lib/glusterd/geo-replication/nftvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nftvol-slave/gsyncd.conf).
The message "I [MSGID: 106488]
[glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
Received get vol req" repeated 2 times between [2019-07-08 08:37:39.774894] and
[2019-07-08 08:37:39.782975]
o/p of glusterd.log on other nodes
[2019-07-08 08:38:08.907384] E [glusterd-op-sm.c:8204:glusterd_op_sm]
(-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24d5e) [0x7fc36af1cd5e]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d12e) [0x7fc36af1512e]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x44604) [0x7fc36af3c604]
) 0-management: Unable to get transaction opinfo for transaction ID
:276653aa-a2b4-404a-8a69-e93c6576f20a
--- Additional comment from Sunny Kumar on 2019-07-08 15:24:18 IST ---
Hi,
Do you have machine available where you are observing this behavior, I tried on
local setup its not reproducible.
-Sunny
--- Additional comment from nchilaka on 2019-07-08 15:31:05 IST ---
(In reply to Sunny Kumar from comment #3)
> Hi,
>
> Do you have machine available where you are observing this behavior, I tried
> on local setup its not reproducible.
>
> -Sunny
yes, I pinged you the details
--- Additional comment from Sunny Kumar on 2019-07-08 16:34:48 IST ---
Update:
Looks like glusterd bug: adding Sanju to look at it. Thanks Sanju for helping
in analysis and please take this forward.
[Switching to Thread 0x7fc36cbdc700 (LWP 22324)]
Breakpoint 1, 0x00007fc36af32670 in glusterd_set_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af32670 in glusterd_set_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af14cb1 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#4 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 2, 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af3c514 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#6 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
bt
^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7fc3773054c0 (LWP 22320)]
0x00007fc375c44017 in pthread_join () from /lib64/libpthread.so.0
(gdb) i b
Num Type Disp Enb Address What
1 breakpoint keep y 0x00007fc36af32670 <glusterd_set_txn_opinfo>
breakpoint already hit 1 time
2 breakpoint keep y 0x00007fc36af329f0 <glusterd_clear_txn_opinfo>
breakpoint already hit 1 time
(gdb) b glusterd_get_txn_opinfo
Breakpoint 3 at 0x7fc36af323c0
(gdb) c
Continuing.
[Switching to Thread 0x7fc36cbdc700 (LWP 22324)]
Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af14c2a in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#4 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 1, 0x00007fc36af32670 in glusterd_set_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af32670 in glusterd_set_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af14cb1 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#4 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af3c434 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af14d63 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#5 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#6 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af34e02 in glusterd_op_ac_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af3c514 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#6 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 2, 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af3c514 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#6 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af32a58 in glusterd_clear_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af3c514 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc36af14d63 in __glusterd_handle_stage_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#5 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#6 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#7 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#8 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
[Switching to Thread 0x7fc36c3db700 (LWP 22325)]
Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
(gdb) bt
#0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#1 0x00007fc36af3c434 in glusterd_op_sm () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2 0x00007fc36af1512e in __glusterd_handle_commit_op () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#3 0x00007fc36af1cd5e in glusterd_big_locked_handler () from
/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#4 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0
#5 0x00007fc375452180 in ?? () from /lib64/libc.so.6
#6 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
-Sunny
--- Additional comment from Sanju on 2019-07-08 16:42:10 IST ---
restoring the needinfo
--- Additional comment from Atin Mukherjee on 2019-07-08 18:34:26 IST ---
I am not sure how comment 5 can confirm we're actually experiencing a hang
issue. Can we please have the complete analysis captured here?
--- Additional comment from Sanju on 2019-07-08 23:06:57 IST ---
Atin,
In the setup provided, I've observed that "gluster v geo-rep status" command is
timing out. When I looked at the glusterd logs from all the nodes, I see below
error logs in the receiver nodes.
[2019-07-08 16:58:22.361593] E [glusterd-op-sm.c:8204:glusterd_op_sm]
(-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24d5e) [0x7f301f805d5e]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d12e) [0x7f301f7fe12e]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x44604) [0x7f301f825604]
) 0-management: Unable to get transaction opinfo for transaction ID
:ec3f67c0-906d-4921-bf51-f8ef73b6dea9
I took glusterd process into gdb in one of the peer node and kept break points
at glusterd_set_txn_opinfo(), glusterd_get_txn_opinfo() and
glusterd_clear_txn_opinfo(). While debugging I observed that we are trying to
get the txn opinfo in commit phase after clearing it in staging phase. I
believe this is related to the changes we have done to fix the mem-leak in
txn-opinfo.
I'm trying to construct a theory based on our changes and the behaviour of this
bug.
Thanks,
Sanju
--- Additional comment from Sanju on 2019-07-11 15:56:20 IST ---
Update:
For no volname operations, no unlock event is triggered where we clean up the
txn_opinfo. For such transactions, we set the skip_locking flag as true in
staging phase. We clean up the txn_opinfo once staging is completed.
The above is working as expected in case of "gluster v status" where the opcode
is "GF_CLI_STATUS_ALL" as we are not sending commit req to peers, so in commit
phase we don't need the txn_opinfo. In this transaction, we do nothing in the
commit phase at the peers and we pass on the volume list to CLI from originator
node only. and then CLI sends volume status req for each volume separately to
glusterd.
But with "gluster v geo-replication status", where we are not specifying the
volume name, the txn_opinfo is cleaned up after staging phase as skip_locking
is set to true. and glusterd tries to get the cleared txn_opinfo in the commit
phase. Here, below is the bt for no volname operations:
(gdb) bt
#0 glusterd_get_gsync_status_all (rsp_dict=0x7f927c016018, node=0x7f9284225c50
"server2")
at glusterd-geo-rep.c:5084
#1 0x00007f928bd4d6a8 in glusterd_get_gsync_status (dict=0x7f927c019df8,
op_errstr=0x7f9284226f20,
rsp_dict=0x7f927c016018) at glusterd-geo-rep.c:5135
#2 0x00007f928bd4fa5f in glusterd_op_gsync_set (dict=0x7f927c019df8,
op_errstr=0x7f9284226f20,
rsp_dict=0x7f927c016018) at glusterd-geo-rep.c:5691
#3 0x00007f928bcc9240 in glusterd_op_commit_perform (op=GD_OP_GSYNC_SET,
dict=0x7f927c019df8,
op_errstr=0x7f9284226f20, rsp_dict=0x7f927c016018) at glusterd-op-sm.c:5924
#4 0x00007f928bcc86fc in glusterd_op_ac_commit_op (event=0x7f927c0159a0,
ctx=0x7f927c013210)
at glusterd-op-sm.c:5625
#5 0x00007f928bccdc37 in glusterd_op_sm () at glusterd-op-sm.c:7857
#6 0x00007f928bca6b56 in __glusterd_handle_commit_op
(req=req at entry=0x7f928401ae18)
at glusterd-handler.c:999
#7 0x00007f928bcaa2ed in glusterd_big_locked_handler (req=0x7f928401ae18,
actor_fn=0x7f928bca6960 <__glusterd_handle_commit_op>) at
glusterd-handler.c:79
#8 0x00007f929833a46a in synctask_wrap () at syncop.c:272
#9 0x00007f92968a3bf0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
glusterd_get_gsync_status_all(), in turn calls glusterd_get_gsync_status_mst()
for all the volumes.
a snippet from glusterd_get_gsync_status_all():
cds_list_for_each_entry(volinfo, &priv->volumes, vol_list)
{
ret = glusterd_get_gsync_status_mst(volinfo, rsp_dict, node);
if (ret)
goto out;
}
So, we need this txn_opinfo in commit phase for geo-rep status. The fix would
be not to clear the txn_opinfo after staging phase and clean it after commit
phase in case of geo-rep operations.
In originator, it will be cleared in gd_sync_task_begin():
if (txn_id) {
if (global)
(void)gd_unlock_op_phase(conf, op, &op_ret, req, op_ctx, op_errstr,
global, is_acquired, *txn_id, &txn_opinfo,
cluster_lock);
else
(void)gd_unlock_op_phase(conf, op, &op_ret, req, op_ctx, op_errstr,
volname, is_acquired, *txn_id,
&txn_opinfo,
cluster_lock);
/* Clearing the transaction opinfo */
ret = glusterd_clear_txn_opinfo(txn_id);
if (ret)
gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_TRANS_OPINFO_CLEAR_FAIL,
"Unable to clear transaction's "
"opinfo for transaction ID : %s",
uuid_utoa(*txn_id));
}
In case of receivers, it will be cleared in glusterd_op_ac_commit_op():
/* for no volname transactions, the txn_opinfo needs to be cleaned up
* as there's no unlock event triggered
*/
if (need_cleanup && txn_id && txn_op_info.skip_locking)
ret = glusterd_clear_txn_opinfo(txn_id);
As per the above details, I expected the below change to make the things work,
but with no luck. I still see this command is hung and I don't see any error
messages in glusterd.log
[root at server2 glusterfs]# git diff
diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
index 91533489d..8b0b5901d 100644
--- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
+++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
@@ -5537,7 +5537,8 @@ out:
* txn_opinfo can't be cleared as that'll lead to a race of referring
op_ctx
* after it's being freed.
*/
- if (txn_op_info.skip_locking && priv->op_version >= GD_OP_VERSION_6_0)
+ if (req_ctx->op != GD_OP_GSYNC_SET && txn_op_info.skip_locking &&
+ priv->op_version >= GD_OP_VERSION_6_0)
ret = glusterd_clear_txn_opinfo(txn_id);
if (rsp_dict)
[root at server2 glusterfs]#
I will continue to investigate on this and update the BZ.
Thanks,
Sanju
--- Additional comment from Sanju on 2019-07-12 16:32:42 IST ---
RCA:
Please read comment 15 to understand this comment well.
In op-sm framework, every stage goes through glusterd_op_sm() function and sets
txn_opinfo based on the txn_id except for the unlock phase.
If the state of op-machine is GD_OP_STATE_STAGED and skip_locking is true, then
we don't set the txn_opinfo because:
1. skip_locking is true says that it is a no volname operation
2. the state of op-machine is GD_OP_STATE_STAGED says that staging phase is
completed.
No volname operations such as "gluster v status" does nothing at the peers, so
we don't send the commit req to peers and clear the txn_opinfo after staging
phase. and we don't need to set txn_opinfo in later phases.
a snippet from glusterd_op_sm() which does the above-explained functionality:
if ((state[event_type].next_state == GD_OP_STATE_DEFAULT) &&
(event_type == GD_OP_EVENT_UNLOCK)) {
/* Clearing the transaction opinfo */
ret = glusterd_clear_txn_opinfo(&event->txn_id);
if (ret)
gf_msg(this->name, GF_LOG_ERROR, 0,
GD_MSG_TRANS_OPINFO_CLEAR_FAIL,
"Unable to clear "
"transaction's opinfo");
} else {
if ((priv->op_version < GD_OP_VERSION_6_0) ||
!(event_type == GD_OP_EVENT_STAGE_OP &&
opinfo.state.state == GD_OP_STATE_STAGED &&
opinfo.skip_locking)) {
ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
if (ret)
gf_msg(this->name, GF_LOG_ERROR, 0,
GD_MSG_TRANS_OPINFO_SET_FAIL,
"Unable to set "
"transaction's opinfo");
}
}
But with geo-rep operation, if we don't specify the volname then also we have
to send the commit req to peers and also we need to set the txn_opinfo in
commit stage. To fix this issue, we can mark skip_locking as true only when it
is a no volname operation and it is not a geo-rep operation. So that it can set
the txn_opinfo in commit stage also.
Thanks,
Sanju
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1727785
[Bug 1727785] gluster v geo-rep status command timing out
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list