[Bugs] [Bug 1557304] New: [Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes

Fri Mar 16 12:01:04 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1557304

            Bug ID: 1557304
           Summary: [Glusterd] Volume operations fail on a (tiered) volume
                    because of a stale lock held by one of the nodes
           Product: GlusterFS
           Version: 3.12
         Component: glusterd
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: amukherj at redhat.com, asriram at redhat.com,
                    bkunal at redhat.com, bmohanra at redhat.com,
                    bsrirama at redhat.com, bturner at redhat.com,
                    bugs at gluster.org, gyadav at redhat.com, olim at redhat.com,
                    pousley at redhat.com, rcyriac at redhat.com,
                    rhs-bugs at redhat.com, sanandpa at redhat.com,
                    sbairagy at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com
        Depends On: 1499004
            Blocks: 1425681, 1503239

+++ This bug was initially created as a clone of Bug #1499004 +++

+++ This bug was initially created as a clone of Bug #1425681 +++

Description of problem:
=======================
Had a tiered volume with 2*(4+2) as cold tier and plain distribute 1*4 as hot
tier in 6 node cluster. Had I/O taking place from fuse as well as nfs mounts,
(ofcourse) in different directories. During the watermarks testing wrt the
volume, reduced the values of low and high watermark, which resulted in the
data percentage of hot tier exceeding the high-watermark - which should result
in demotions (only).

Was monitoring the demotions taking place via the command 'gluster volume tier
<volname> status'. After a while, the said command started failing with
'Another transaction is in progress for <volname>. Please try again after
sometime'. And it has got stuck in that state since a day now. Glusterd logs
complain of 'another lock being held by <uuid>'.

(Do not think it is related, but fyi)
While monitoring the demotions 'gluster volume tier <volname> status' and
waiting for them to get completed, I did create a new dist-rep volume 2*2 and
set 'nfs.disable' to 'off'. 
Soon after that when I repeated the 'tier status' command, it started failing
with '...another transaction is in progress...'

'glusterd restart' (as advised by Atin) on the node (which had held the lock)
seems to have got the volume back to normal.

Sosreports at :
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Version-Release number of selected component (if applicable):
===========================================================
3.8.4-14

How reproducible:
=================
1:1

Additional info:
==================

[root at dhcp46-221 ~]# 
[root at dhcp46-221 ~]# gluster peer stauts
unrecognized word: stauts (position 1)
[root at dhcp46-221 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp46-242.lab.eng.blr.redhat.com
Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa
State: Peer in Cluster (Connected)
Other names:
10.70.46.242

Hostname: 10.70.46.239
Uuid: b9af0965-ffe7-4827-b610-2380a8fa810b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.240
Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d
State: Peer in Cluster (Connected)

Hostname: 10.70.46.218
Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923
State: Peer in Cluster (Connected)

Hostname: 10.70.46.222
Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0
State: Peer in Cluster (Connected)
[root at dhcp46-221 ~]# vim /var/log/glusterfs/glusterd.log
[root at dhcp46-221 ~]# gluster v status
Another transaction is in progress for ozone. Please try again after sometime.

Status of volume: vola
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.239:/bricks/brick3/vola_0    49152     0          Y       5259 
Brick 10.70.46.240:/bricks/brick3/vola_1    49152     0          Y       20012
Brick 10.70.46.242:/bricks/brick3/vola_2    49153     0          Y       21512
Brick 10.70.46.218:/bricks/brick3/vola_3    49155     0          Y       28705
NFS Server on localhost                     2049      0          Y       31911
Self-heal Daemon on localhost               N/A       N/A        Y       31743
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       21788
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       21563
NFS Server on 10.70.46.239                  2049      0          Y       5699 
Self-heal Daemon on 10.70.46.239            N/A       N/A        Y       5291 
NFS Server on 10.70.46.218                  2049      0          Y       28899
Self-heal Daemon on 10.70.46.218            N/A       N/A        Y       28759
NFS Server on 10.70.46.240                  2049      0          Y       20201
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       20061
NFS Server on 10.70.46.222                  2049      0          Y       1784 
Self-heal Daemon on 10.70.46.222            N/A       N/A        Y       1588 

Task Status of Volume vola
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp46-221 ~]# 
[root at dhcp46-221 ~]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-14.el7rhgs.x86_64
glusterfs-fuse-3.8.4-14.el7rhgs.x86_64
glusterfs-rdma-3.8.4-14.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-14.el7rhgs.x86_64
glusterfs-cli-3.8.4-14.el7rhgs.x86_64
glusterfs-events-3.8.4-14.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-server-3.8.4-14.el7rhgs.x86_64
python-gluster-3.8.4-14.el7rhgs.noarch
glusterfs-geo-replication-3.8.4-14.el7rhgs.x86_64
glusterfs-3.8.4-14.el7rhgs.x86_64
glusterfs-api-3.8.4-14.el7rhgs.x86_64

[root at dhcp46-221 ~]# 
[root at dhcp46-221 ~]# 

########## after glusterd restart   ################

[root at dhcp46-221 ~]# 
[root at dhcp46-221 ~]# gluster v status ozone
Status of volume: ozone
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.46.222:/bricks/brick2/ozone_tie
r3                                          49152     0          Y       18409
Brick 10.70.46.221:/bricks/brick2/ozone_tie
r2                                          49152     0          Y       16208
Brick 10.70.46.218:/bricks/brick2/ozone_tie
r1                                          49152     0          Y       1655 
Brick 10.70.46.242:/bricks/brick2/ozone_tie
r0                                          49152     0          Y       23869
Cold Bricks:
Brick 10.70.46.239:/bricks/brick0/ozone_0   49156     0          Y       18567
Brick 10.70.46.240:/bricks/brick0/ozone_1   49156     0          Y       21626
Brick 10.70.46.242:/bricks/brick0/ozone_2   49156     0          Y       10841
Brick 10.70.46.218:/bricks/brick0/ozone_3   49153     0          Y       27354
Brick 10.70.46.221:/bricks/brick0/ozone_4   49154     0          Y       2139 
Brick 10.70.46.222:/bricks/brick0/ozone_5   49154     0          Y       4378 
Brick 10.70.46.239:/bricks/brick1/ozone_6   49157     0          Y       18587
Brick 10.70.46.240:/bricks/brick1/ozone_7   49157     0          Y       21646
Brick 10.70.46.242:/bricks/brick1/ozone_8   49157     0          Y       10861
Brick 10.70.46.218:/bricks/brick1/ozone_9   49154     0          Y       27353
Brick 10.70.46.221:/bricks/brick1/ozone_10  49155     0          Y       2159 
Brick 10.70.46.222:/bricks/brick1/ozone_11  49155     0          Y       4398 
NFS Server on localhost                     2049      0          Y       5622 
Self-heal Daemon on localhost               N/A       N/A        Y       5630 
Quota Daemon on localhost                   N/A       N/A        Y       5639 
NFS Server on 10.70.46.239                  2049      0          Y       15129
Self-heal Daemon on 10.70.46.239            N/A       N/A        Y       15152
Quota Daemon on 10.70.46.239                N/A       N/A        Y       15189
NFS Server on 10.70.46.240                  2049      0          Y       25626
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       25647
Quota Daemon on 10.70.46.240                N/A       N/A        Y       25657
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       20513
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       20540
Quota Daemon on dhcp46-242.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       20565
NFS Server on 10.70.46.222                  2049      0          Y       6509 
Self-heal Daemon on 10.70.46.222            N/A       N/A        Y       6532 
Quota Daemon on 10.70.46.222                N/A       N/A        Y       6549 
NFS Server on 10.70.46.218                  2049      0          Y       11094
Self-heal Daemon on 10.70.46.218            N/A       N/A        Y       11120
Quota Daemon on 10.70.46.218                N/A       N/A        Y       11143

Task Status of Volume ozone
------------------------------------------------------------------------------
Task                 : Tier migration      
ID                   : 19fb4787-d9de-4436-8f15-86ff39fbc7bb
Status               : in progress         

[root at dhcp46-221 ~]# gluster v tier ozone status
Node                 Promoted files       Demoted files        Status           
---------            ---------            ---------            ---------        
localhost            0                    2033                 in progress      
dhcp46-242.lab.eng.blr.redhat.com 0                    2025                 in
progress         
10.70.46.239         14                   0                    in progress      
10.70.46.240         0                    0                    in progress      
10.70.46.218         0                    2238                 in progress      
10.70.46.222         0                    2167                 in progress      
Tiering Migration Functionality: ozone: success
[root at dhcp46-221 ~]#

--- Additional comment from Worker Ant on 2017-10-05 15:08:42 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#2) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-05 23:22:30 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#3) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-06 03:32:47 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#4) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-06 10:44:57 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#5) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-10 12:29:10 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#6) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-10 13:39:02 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#7) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-10 21:51:58 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#8) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-12 01:32:14 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#9) for review on master by Gaurav Yadav
(gyadav at redhat.com)

--- Additional comment from Worker Ant on 2017-10-17 00:59:37 EDT ---

REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in
mgmt_v3_lock) posted (#10) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-10-17 11:44:54 EDT ---

COMMIT: https://review.gluster.org/18437 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit 614904fa7a31bf6f69074238b7e710a20e05e1bb
Author: Gaurav Yadav <gyadav at redhat.com>
Date:   Thu Oct 5 23:44:46 2017 +0530

    glusterd : introduce timer in mgmt_v3_lock

    Problem:
    In a multinode environment, if two of the op-sm transactions
    are initiated on one of the receiver nodes at the same time,
    there might be a possibility that glusterd  may end up in
    stale lock.

    Solution:
    During mgmt_v3_lock a registration is made to  gf_timer_call_after
    which release the lock after certain period of time

    Change-Id: I16cc2e5186a2e8a5e35eca2468b031811e093843
    BUG: 1499004
    Signed-off-by: Gaurav Yadav <gyadav at redhat.com>

--- Additional comment from Shyamsundar on 2017-12-08 12:42:08 EST ---

This bug is getting closed because a release has been made available that
should address the reported issue. In case the problem is still not fixed with
glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages
for several distributions should become available in the near future. Keep an
eye on the Gluster Users mailinglist [2] and the update infrastructure for your
distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1425681
[Bug 1425681] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1499004
[Bug 1499004] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1503239
[Bug 1503239] [Glusterd] Volume operations fail on a (tiered) volume
because of a stale lock held by one of the nodes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.