[Bugs] [Bug 1175765] New: [USS]: When snapd is crashed gluster volume stop/delete operation fails making the cluster in inconsistent state

Thu Dec 18 14:18:39 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1175765

            Bug ID: 1175765
           Summary: [USS]: When snapd is crashed gluster volume
                    stop/delete operation fails making the cluster in
                    inconsistent state
           Product: GlusterFS
           Version: 3.6.1
         Component: core
          Keywords: ZStream
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: vmallika at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    gluster-bugs at redhat.com, rjoseph at redhat.com,
                    ssamanta at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1168607, 1168803

+++ This bug was initially created as a clone of Bug #1168803 +++

+++ This bug was initially created as a clone of Bug #1168607 +++

Description of problem:
When snapd is crashed the gluster volume stop/delete operation fails which
makes cluster in inconsistent state.

Version-Release number of selected component (if applicable):

[root at dhcp42-244 core]# rpm -qa | grep gluster
gluster-nagios-common-0.1.3-2.el6rhs.noarch
samba-glusterfs-3.6.509-169.1.el6rhs.x86_64
glusterfs-libs-3.6.0.34-1.el6rhs.x86_64
glusterfs-server-3.6.0.34-1.el6rhs.x86_64
glusterfs-cli-3.6.0.34-1.el6rhs.x86_64
glusterfs-3.6.0.34-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.34-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.34-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.34-1.el6rhs.x86_64
vdsm-gluster-4.14.7.2-1.el6rhs.noarch
gluster-nagios-addons-0.1.10-2.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.34-1.el6rhs.x86_64
glusterfs-api-3.6.0.34-1.el6rhs.x86_64
[root at dhcp42-244 core]# 

How reproducible:
Always

Steps to Reproduce:
1. Create a 2*2 dist-rep volume and start it
2. Mount the volume from client (Fuse)
3. Take 2 snapshots(snap1 and snap2) and enable USS
4. Send lookup continuously on .snaps and snapd got crashed 
   unexpectedly(Hit BZ 1168497)
5. Tried to stop the volume and it failed with error "Commit failed on
localhost. Please check the log file for more details"

Please See "Additional Info" section for more details.

Actual results:
Volume state in one node is "Stopped" state whereas on other node it is in
"Started" State which makes the cluster in inconsistent state.

Expected results:
Volume should be stopped/deleted successfully.

Additional info:
Node1:
=====
[root at dhcp42-244 core]# gluster peer status
Number of Peers: 3

Hostname: 10.70.43.6
Uuid: 2c0d5fe8-a014-4978-ace7-c663e4cc8d91
State: Peer in Cluster (Connected)

Hostname: 10.70.42.204
Uuid: 2a2a1b36-37e3-4336-b82a-b09dcc2f745e
State: Peer in Cluster (Connected)

Hostname: 10.70.42.10
Uuid: 77c49bfc-6cb4-44f3-be12-41447a3a452e
State: Peer in Cluster (Connected)
[root at dhcp42-244 core]#

[root at dhcp42-244 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
features.uss: on
features.barrier: disable
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.barrier: disable
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256
[root at dhcp42-244 ~]#

[root at dhcp42-244 core]# gluster volume stop testvol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n)
y
volume stop: testvol: failed: Commit failed on localhost. Please check the log
file for more details.
[root at dhcp42-244 core]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Cannot delete Volume testvol ,as it has 2
snapshots. To delete the volume, first delete all the snapshots under it.

[root at dhcp42-244 core]# gluster snapshot list testvol
snap1
snap2

[root at dhcp42-244 core]# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap. Do you still want
to continue? (y/n) y
snapshot delete: snap1: snap removed successfully
[root at dhcp42-244 core]# gluster snapshot delete snap2
Deleting snap will erase all the information about the snap. Do you still want
to continue? (y/n) y
snapshot delete: snap2: snap removed successfully

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume
testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
[root at dhcp42-244 core]#

[root at dhcp42-244 core]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume
testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
[root at dhcp42-244 core]#

Node2:
=====
[root at dhcp43-6 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started ---> "gluster volume info testvol" command at Node2
                      is in Started state whereas Node1 is 
                      in Stopped State which makes the cluster in 
                      inconsistent state"
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: off
features.barrier: disable
features.uss: off
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Client Logs:
============
[root at dhcp43-190 fusemnt]# cd .snaps
-bash: cd: .snaps: Transport endpoint is not connected
[root at dhcp43-190 fusemnt]#

--- Additional comment from Anand Avati on 2014-11-28 00:18:02 EST ---

REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running,
return success from glusterd_handle_snapd_option) posted (#1) for review on
master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Anand Avati on 2014-11-28 01:16:32 EST ---

REVIEW: http://review.gluster.org/9206 (glusterd/uss: if snapd is not running,
return success from glusterd_handle_snapd_option) posted (#2) for review on
master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Anand Avati on 2014-12-01 02:05:30 EST ---

COMMIT: http://review.gluster.org/9206 committed in master by Krishnan
Parthasarathi (kparthas at redhat.com) 
------
commit 92242ecd1047fe23ca494555edd6033685522c82
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Fri Nov 28 10:46:20 2014 +0530

    glusterd/uss: if snapd is not running, return success from
glusterd_handle_snapd_option

    glusterd_handle_snapd_option was returning failure if snapd is not running
    because of which gluster commands were failing.

    Change-Id: I22286f4ecf28b57dfb6fb8ceb52ca8bdc66aec5d
    BUG: 1168803
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: http://review.gluster.org/9206
    Reviewed-by: Kaushal M <kaushal at redhat.com>
    Reviewed-by: Avra Sengupta <asengupt at redhat.com>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Vijaikumar Mallikarjuna <vmallika at redhat.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas at redhat.com>
    Tested-by: Krishnan Parthasarathi <kparthas at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1168803
[Bug 1168803] [USS]: When snapd is crashed gluster volume stop/delete
operation fails making the cluster in inconsistent state
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.