[Bugs] [Bug 1710159] glusterd: While upgrading (3-node cluster) 'gluster v status' times out on node to be upgraded
bugzilla at redhat.com
bugzilla at redhat.com
Wed May 15 06:13:49 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1710159
--- Comment #2 from Sanju <srakonde at redhat.com> ---
Root cause:
With commit 34e010d64, we have added some conditions to set txn-opinfo to avoid
the memory leak in txn-opinfo object. But, in a heterogeneous cluster the
upgraded and non-upgraded nodes are following different conditions to set
txn-opinfo. This is leading the get-txn-opinfo operation to fail and eventually
the process hungs.
[root at server2 glusterfs]# git show 34e010d64
commit 34e010d64905b7387de57840d3fb16a326853c9b
Author: Atin Mukherjee <amukherj at redhat.com>
Date: Mon Mar 18 16:08:04 2019 +0530
glusterd: fix txn-id mem leak
This commit ensures the following:
1. Don't send commit op request to the remote nodes when gluster v
status all is executed as for the status all transaction the local
commit gets the name of the volumes and remote commit ops are
technically a no-op. So no need for additional rpc requests.
2. In op state machine flow, if the transaction is in staged state and
op_info.skip_locking is true, then no need to set the txn id in the
priv->glusterd_txn_opinfo dictionary which never gets freed.
Fixes: bz#1691164
Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822
Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
index 6495a9d88..84c34f1fe 100644
--- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
+++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
@@ -5652,6 +5652,9 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
dict_t *dict = NULL;
xlator_t *this = NULL;
uuid_t *txn_id = NULL;
+ glusterd_op_info_t txn_op_info = {
+ {0},
+ };
this = THIS;
GF_ASSERT(this);
@@ -5686,6 +5689,7 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
ret = -1;
goto out;
}
+ ret = glusterd_get_txn_opinfo(&event->txn_id, &txn_op_info);
ret = dict_set_bin(rsp_dict, "transaction_id", txn_id, sizeof(*txn_id));
if (ret) {
@@ -5704,6 +5708,12 @@ out:
gf_msg_debug(this->name, 0, "Returning with %d", ret);
+ /* for no volname transactions, the txn_opinfo needs to be cleaned up
+ * as there's no unlock event triggered
+ */
+ if (txn_op_info.skip_locking)
+ ret = glusterd_clear_txn_opinfo(txn_id);
+
if (rsp_dict)
dict_unref(rsp_dict);
@@ -8159,12 +8169,16 @@ glusterd_op_sm()
"Unable to clear "
"transaction's opinfo");
} else {
- ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
- if (ret)
- gf_msg(this->name, GF_LOG_ERROR, 0,
- GD_MSG_TRANS_OPINFO_SET_FAIL,
- "Unable to set "
- "transaction's opinfo");
+ if (!(event_type == GD_OP_EVENT_STAGE_OP &&
+ opinfo.state.state == GD_OP_STATE_STAGED &&
+ opinfo.skip_locking)) { <---- now, upgraded nodes
will not set txn-opinfo when this condition is false, so the
glusterd_get_txn_opinfo() after this is failing. previously we used to set
txn-opinfo in every state of op-sm and glusterd_get_txn_opinfo will be called
in every phase. We need to add an op-version check for this change.
+ ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
+ if (ret)
+ gf_msg(this->name, GF_LOG_ERROR, 0,
+ GD_MSG_TRANS_OPINFO_SET_FAIL,
+ "Unable to set "
+ "transaction's opinfo");
+ }
}
glusterd_destroy_op_event_ctx(event);
diff --git a/xlators/mgmt/glusterd/src/glusterd-syncop.c
b/xlators/mgmt/glusterd/src/glusterd-syncop.c
index 45b221c2e..9bab2cfd5 100644
--- a/xlators/mgmt/glusterd/src/glusterd-syncop.c
+++ b/xlators/mgmt/glusterd/src/glusterd-syncop.c
@@ -1392,6 +1392,8 @@ gd_commit_op_phase(glusterd_op_t op, dict_t *op_ctx,
dict_t *req_dict,
char *errstr = NULL;
struct syncargs args = {0};
int type = GF_QUOTA_OPTION_TYPE_NONE;
+ uint32_t cmd = 0;
+ gf_boolean_t origin_glusterd = _gf_false;
this = THIS;
GF_ASSERT(this);
@@ -1449,6 +1451,20 @@ commit_done:
gd_syncargs_init(&args, op_ctx);
synctask_barrier_init((&args));
peer_cnt = 0;
+ origin_glusterd = is_origin_glusterd(req_dict);
+
+ if (op == GD_OP_STATUS_VOLUME) {
+ ret = dict_get_uint32(req_dict, "cmd", &cmd);
+ if (ret)
+ goto out;
+
+ if (origin_glusterd) {
+ if ((cmd & GF_CLI_STATUS_ALL)) {
+ ret = 0;
+ goto out;
+ }
+ }
+ }
RCU_READ_LOCK;
cds_list_for_each_entry_rcu(peerinfo, &conf->peers, uuid_list)
(END)
Thanks,
Sanju
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list