[Bugs] [Bug 1557304] New: [Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes
bugzilla at redhat.com
bugzilla at redhat.com
Fri Mar 16 12:01:04 UTC 2018
https://bugzilla.redhat.com/show_bug.cgi?id=1557304
Bug ID: 1557304
Summary: [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
Product: GlusterFS
Version: 3.12
Component: glusterd
Severity: urgent
Assignee: bugs at gluster.org
Reporter: amukherj at redhat.com
CC: amukherj at redhat.com, asriram at redhat.com,
bkunal at redhat.com, bmohanra at redhat.com,
bsrirama at redhat.com, bturner at redhat.com,
bugs at gluster.org, gyadav at redhat.com, olim at redhat.com,
pousley at redhat.com, rcyriac at redhat.com,
rhs-bugs at redhat.com, sanandpa at redhat.com,
sbairagy at redhat.com, storage-qa-internal at redhat.com,
vbellur at redhat.com
Depends On: 1499004
Blocks: 1425681, 1503239
+++ This bug was initially created as a clone of Bug #1499004 +++
+++ This bug was initially created as a clone of Bug #1425681 +++
Description of problem:
=======================
Had a tiered volume with 2*(4+2) as cold tier and plain distribute 1*4 as hot
tier in 6 node cluster. Had I/O taking place from fuse as well as nfs mounts,
(ofcourse) in different directories. During the watermarks testing wrt the
volume, reduced the values of low and high watermark, which resulted in the
data percentage of hot tier exceeding the high-watermark - which should result
in demotions (only).
Was monitoring the demotions taking place via the command 'gluster volume tier
<volname> status'. After a while, the said command started failing with
'Another transaction is in progress for <volname>. Please try again after
sometime'. And it has got stuck in that state since a day now. Glusterd logs
complain of 'another lock being held by <uuid>'.
(Do not think it is related, but fyi)
While monitoring the demotions 'gluster volume tier <volname> status' and
waiting for them to get completed, I did create a new dist-rep volume 2*2 and
set 'nfs.disable' to 'off'.
Soon after that when I repeated the 'tier status' command, it started failing
with '...another transaction is in progress...'
'glusterd restart' (as advised by Atin) on the node (which had held the lock)
seems to have got the volume back to normal.
Sosreports at :
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
Version-Release number of selected component (if applicable):
===========================================================
3.8.4-14
How reproducible:
=================
1:1
Additional info:
==================
[root at dhcp46-221 ~]#
[root at dhcp46-221 ~]# gluster peer stauts
unrecognized word: stauts (position 1)
[root at dhcp46-221 ~]# gluster peer status
Number of Peers: 5
Hostname: dhcp46-242.lab.eng.blr.redhat.com
Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa
State: Peer in Cluster (Connected)
Other names:
10.70.46.242
Hostname: 10.70.46.239
Uuid: b9af0965-ffe7-4827-b610-2380a8fa810b
State: Peer in Cluster (Connected)
Hostname: 10.70.46.240
Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d
State: Peer in Cluster (Connected)
Hostname: 10.70.46.218
Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923
State: Peer in Cluster (Connected)
Hostname: 10.70.46.222
Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0
State: Peer in Cluster (Connected)
[root at dhcp46-221 ~]# vim /var/log/glusterfs/glusterd.log
[root at dhcp46-221 ~]# gluster v status
Another transaction is in progress for ozone. Please try again after sometime.
Status of volume: vola
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.46.239:/bricks/brick3/vola_0 49152 0 Y 5259
Brick 10.70.46.240:/bricks/brick3/vola_1 49152 0 Y 20012
Brick 10.70.46.242:/bricks/brick3/vola_2 49153 0 Y 21512
Brick 10.70.46.218:/bricks/brick3/vola_3 49155 0 Y 28705
NFS Server on localhost 2049 0 Y 31911
Self-heal Daemon on localhost N/A N/A Y 31743
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com 2049 0 Y 21788
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com N/A N/A Y 21563
NFS Server on 10.70.46.239 2049 0 Y 5699
Self-heal Daemon on 10.70.46.239 N/A N/A Y 5291
NFS Server on 10.70.46.218 2049 0 Y 28899
Self-heal Daemon on 10.70.46.218 N/A N/A Y 28759
NFS Server on 10.70.46.240 2049 0 Y 20201
Self-heal Daemon on 10.70.46.240 N/A N/A Y 20061
NFS Server on 10.70.46.222 2049 0 Y 1784
Self-heal Daemon on 10.70.46.222 N/A N/A Y 1588
Task Status of Volume vola
------------------------------------------------------------------------------
There are no active volume tasks
[root at dhcp46-221 ~]#
[root at dhcp46-221 ~]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-14.el7rhgs.x86_64
glusterfs-fuse-3.8.4-14.el7rhgs.x86_64
glusterfs-rdma-3.8.4-14.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-14.el7rhgs.x86_64
glusterfs-cli-3.8.4-14.el7rhgs.x86_64
glusterfs-events-3.8.4-14.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-server-3.8.4-14.el7rhgs.x86_64
python-gluster-3.8.4-14.el7rhgs.noarch
glusterfs-geo-replication-3.8.4-14.el7rhgs.x86_64
glusterfs-3.8.4-14.el7rhgs.x86_64
glusterfs-api-3.8.4-14.el7rhgs.x86_64
[root at dhcp46-221 ~]#
[root at dhcp46-221 ~]#
########## after glusterd restart ################
[root at dhcp46-221 ~]#
[root at dhcp46-221 ~]# gluster v status ozone
Status of volume: ozone
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.46.222:/bricks/brick2/ozone_tie
r3 49152 0 Y 18409
Brick 10.70.46.221:/bricks/brick2/ozone_tie
r2 49152 0 Y 16208
Brick 10.70.46.218:/bricks/brick2/ozone_tie
r1 49152 0 Y 1655
Brick 10.70.46.242:/bricks/brick2/ozone_tie
r0 49152 0 Y 23869
Cold Bricks:
Brick 10.70.46.239:/bricks/brick0/ozone_0 49156 0 Y 18567
Brick 10.70.46.240:/bricks/brick0/ozone_1 49156 0 Y 21626
Brick 10.70.46.242:/bricks/brick0/ozone_2 49156 0 Y 10841
Brick 10.70.46.218:/bricks/brick0/ozone_3 49153 0 Y 27354
Brick 10.70.46.221:/bricks/brick0/ozone_4 49154 0 Y 2139
Brick 10.70.46.222:/bricks/brick0/ozone_5 49154 0 Y 4378
Brick 10.70.46.239:/bricks/brick1/ozone_6 49157 0 Y 18587
Brick 10.70.46.240:/bricks/brick1/ozone_7 49157 0 Y 21646
Brick 10.70.46.242:/bricks/brick1/ozone_8 49157 0 Y 10861
Brick 10.70.46.218:/bricks/brick1/ozone_9 49154 0 Y 27353
Brick 10.70.46.221:/bricks/brick1/ozone_10 49155 0 Y 2159
Brick 10.70.46.222:/bricks/brick1/ozone_11 49155 0 Y 4398
NFS Server on localhost 2049 0 Y 5622
Self-heal Daemon on localhost N/A N/A Y 5630
Quota Daemon on localhost N/A N/A Y 5639
NFS Server on 10.70.46.239 2049 0 Y 15129
Self-heal Daemon on 10.70.46.239 N/A N/A Y 15152
Quota Daemon on 10.70.46.239 N/A N/A Y 15189
NFS Server on 10.70.46.240 2049 0 Y 25626
Self-heal Daemon on 10.70.46.240 N/A N/A Y 25647
Quota Daemon on 10.70.46.240 N/A N/A Y 25657
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com 2049 0 Y 20513
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com N/A N/A Y 20540
Quota Daemon on dhcp46-242.lab.eng.blr.redh
at.com N/A N/A Y 20565
NFS Server on 10.70.46.222 2049 0 Y 6509
Self-heal Daemon on 10.70.46.222 N/A N/A Y 6532
Quota Daemon on 10.70.46.222 N/A N/A Y 6549
NFS Server on 10.70.46.218 2049 0 Y 11094
Self-heal Daemon on 10.70.46.218 N/A N/A Y 11120
Quota Daemon on 10.70.46.218 N/A N/A Y 11143
Task Status of Volume ozone
------------------------------------------------------------------------------
Task : Tier migration
ID : 19fb4787-d9de-4436-8f15-86ff39fbc7bb
Status : in progress
[root at dhcp46-221 ~]# gluster v tier ozone status
Node Promoted files Demoted files Status
--------- --------- --------- ---------
localhost 0 2033 in progress
dhcp46-242.lab.eng.blr.redhat.com 0 2025 in
progress
10.70.46.239 14 0 in progress
10.70.46.240 0 0 in progress
10.70.46.218 0 2238 in progress
10.70.46.222 0 2167 in progress
Tiering Migration Functionality: ozone: success
[root at dhcp46-221 ~]#
--- Additional comment from Worker Ant on 2017-10-05 15:08:42 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#2) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-05 23:22:30 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#3) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-06 03:32:47 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#4) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-06 10:44:57 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#5) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-10 12:29:10 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#6) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-10 13:39:02 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#7) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-10 21:51:58 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#8) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-12 01:32:14 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#9) for review on master by Gaurav Yadav
(gyadav at redhat.com)
--- Additional comment from Worker Ant on 2017-10-17 00:59:37 EDT ---
REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#10) for review on master by Atin Mukherjee
(amukherj at redhat.com)
--- Additional comment from Worker Ant on 2017-10-17 11:44:54 EDT ---
COMMIT: https://review.gluster.org/18437 committed in master by Atin Mukherjee
(amukherj at redhat.com)
------
commit 614904fa7a31bf6f69074238b7e710a20e05e1bb
Author: Gaurav Yadav <gyadav at redhat.com>
Date: Thu Oct 5 23:44:46 2017 +0530
glusterd : introduce timer in mgmt_v3_lock
Problem:
In a multinode environment, if two of the op-sm transactions
are initiated on one of the receiver nodes at the same time,
there might be a possibility that glusterd may end up in
stale lock.
Solution:
During mgmt_v3_lock a registration is made to gf_timer_call_after
which release the lock after certain period of time
Change-Id: I16cc2e5186a2e8a5e35eca2468b031811e093843
BUG: 1499004
Signed-off-by: Gaurav Yadav <gyadav at redhat.com>
--- Additional comment from Shyamsundar on 2017-12-08 12:42:08 EST ---
This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.13.0, please open a new bug report.
glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.
[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1425681
[Bug 1425681] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1499004
[Bug 1499004] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1503239
[Bug 1503239] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list