[Bugs] [Bug 1168803] New: [USS]: When snapd is crashed gluster volume stop/delete operation fails making the cluster in inconsistent state

Fri Nov 28 04:58:08 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1168803

            Bug ID: 1168803
           Summary: [USS]: When snapd is crashed gluster volume
                    stop/delete operation fails making the cluster in
                    inconsistent state
           Product: GlusterFS
           Version: mainline
         Component: core
          Keywords: ZStream
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    rjoseph at redhat.com, ssamanta at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1168607

+++ This bug was initially created as a clone of Bug #1168607 +++

Description of problem:
When snapd is crashed the gluster volume stop/delete operation fails which
makes cluster in inconsistent state.

Version-Release number of selected component (if applicable):

[root at dhcp42-244 core]# rpm -qa | grep gluster
gluster-nagios-common-0.1.3-2.el6rhs.noarch
samba-glusterfs-3.6.509-169.1.el6rhs.x86_64
glusterfs-libs-3.6.0.34-1.el6rhs.x86_64
glusterfs-server-3.6.0.34-1.el6rhs.x86_64
glusterfs-cli-3.6.0.34-1.el6rhs.x86_64
glusterfs-3.6.0.34-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.34-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.34-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.34-1.el6rhs.x86_64
vdsm-gluster-4.14.7.2-1.el6rhs.noarch
gluster-nagios-addons-0.1.10-2.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.34-1.el6rhs.x86_64
glusterfs-api-3.6.0.34-1.el6rhs.x86_64
[root at dhcp42-244 core]# 

How reproducible:
Always

Steps to Reproduce:
1. Create a 2*2 dist-rep volume and start it
2. Mount the volume from client (Fuse)
3. Take 2 snapshots(snap1 and snap2) and enable USS
4. Send lookup continuously on .snaps and snapd got crashed 
   unexpectedly(Hit BZ 1168497)
5. Tried to stop the volume and it failed with error "Commit failed on
localhost. Please check the log file for more details"

Please See "Additional Info" section for more details.

Actual results:
Volume state in one node is "Stopped" state whereas on other node it is in
"Started" State which makes the cluster in inconsistent state.

Expected results:
Volume should be stopped/deleted successfully.

Additional info:
Node1:
=====
[root at dhcp42-244 core]# gluster peer status
Number of Peers: 3

Hostname: 10.70.43.6
Uuid: 2c0d5fe8-a014-4978-ace7-c663e4cc8d91
State: Peer in Cluster (Connected)

Hostname: 10.70.42.204
Uuid: 2a2a1b36-37e3-4336-b82a-b09dcc2f745e
State: Peer in Cluster (Connected)

Hostname: 10.70.42.10
Uuid: 77c49bfc-6cb4-44f3-be12-41447a3a452e
State: Peer in Cluster (Connected)
[root at dhcp42-244 core]#

[root at dhcp42-244 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
features.uss: on
features.barrier: disable
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.barrier: disable
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256
[root at dhcp42-244 ~]#

[root at dhcp42-244 core]# gluster volume stop testvol
Stopping volume will make its data inaccessible. Do you want to continue? (y/n)
y
volume stop: testvol: failed: Commit failed on localhost. Please check the log
file for more details.
[root at dhcp42-244 core]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Cannot delete Volume testvol ,as it has 2
snapshots. To delete the volume, first delete all the snapshots under it.

[root at dhcp42-244 core]# gluster snapshot list testvol
snap1
snap2

[root at dhcp42-244 core]# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap. Do you still want
to continue? (y/n) y
snapshot delete: snap1: snap removed successfully
[root at dhcp42-244 core]# gluster snapshot delete snap2
Deleting snap will erase all the information about the snap. Do you still want
to continue? (y/n) y
snapshot delete: snap2: snap removed successfully

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume
testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
[root at dhcp42-244 core]#

[root at dhcp42-244 core]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: on
features.barrier: disable
features.uss: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Volume Name: testvol1
Type: Distributed-Replicate
Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick2/testvol
Brick2: 10.70.43.6:/rhs/brick3/testvol
Brick3: 10.70.42.204:/rhs/brick4/testvol
Brick4: 10.70.42.10:/rhs/brick1/testvol
Options Reconfigured:
features.barrier: disable
features.uss: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root at dhcp42-244 core]# gluster volume delete testvol
Deleting volume will erase all information about the volume. Do you want to
continue? (y/n) y
volume delete: testvol: failed: Staging failed on 10.70.43.6. Error: Volume
testvol has been started.Volume needs to be stopped before deletion.
Staging failed on 10.70.42.204. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
Staging failed on 10.70.42.10. Error: Volume testvol has been started.Volume
needs to be stopped before deletion.
[root at dhcp42-244 core]#

Node2:
=====
[root at dhcp43-6 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b
Status: Started ---> "gluster volume info testvol" command at Node2
                      is in Started state whereas Node1 is 
                      in Stopped State which makes the cluster in 
                      inconsistent state"
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.42.244:/rhs/brick1/testvol
Brick2: 10.70.43.6:/rhs/brick2/testvol
Brick3: 10.70.42.204:/rhs/brick3/testvol
Brick4: 10.70.42.10:/rhs/brick4/testvol
Options Reconfigured:
performance.readdir-ahead: off
features.barrier: disable
features.uss: off
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Client Logs:
============
[root at dhcp43-190 fusemnt]# cd .snaps
-bash: cd: .snaps: Transport endpoint is not connected
[root at dhcp43-190 fusemnt]#

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.