[Bugs] [Bug 1728766] New: Volume start failed when shd is down in one of the node in cluster

Wed Jul 10 15:57:22 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1728766

            Bug ID: 1728766
           Summary: Volume start failed when shd is down in one of the
                    node in cluster
           Product: GlusterFS
           Version: mainline
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: rkavunga at redhat.com
                CC: amukherj at redhat.com, anepatel at redhat.com,
                    bmekala at redhat.com, bugs at gluster.org,
                    nchilaka at redhat.com, rhs-bugs at redhat.com,
                    rkavunga at redhat.com, sankarshan at redhat.com,
                    srakonde at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com, vdas at redhat.com
        Depends On: 1726219
            Blocks: 1696809
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1726219 +++

Description of problem:

Volume info o/p is not consistent across the cluster, output from two nodes
says volume is in stopped state, whereas one node says volume is in start
state.

Node1: 
[root at dhcp35-50 ~]# gluster v info test3

Volume Name: test3
Type: Replicate
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/tes3
Brick2: 10.70.46.216:/bricks/brick1/tes3
Brick3: 10.70.46.132:/bricks/brick1/tes3
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
[root at dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.216. Error: Volume test3 is not started
Staging failed on 10.70.46.132. Error: Volume test3 is not started

Node 2: 

[root at dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root at dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped
==================================================

Version-Release number of selected component (if applicable):

How reproducible:
2/2

Steps to Reproduce:
1.  Create 2 replica 3 vols
2.  Stop 1 volume, execute command on node 1 (35.50)
[root at dhcp35-50 ~]# gluster v stop test3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n)
y
volume stop: test3: success

3.  Kill shd on one node
kill -15 5928
4.  Check #gluster v info from all 3 nodes
Volume is in stopped state, as seen from o/p of all three nodes
5. Now start volume from node 1 
# gluster v start test3
volume start: test3: failed: Commit failed on localhost. Please check log file
for details.
O/p says volume start failed.
6. Now check vol info o/p on all three nodes

Node1:
[root at dhcp35-50 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started

Node2:
[root at dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root at dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Actual results:

As described above in Steps to reproduce

Expected results:

1. Volume should start without any error (confirmed that volume starts in older
release (glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

2. Command o/p should be consistent when executed from any nodes, (As all
automation cases randomly take any node as master for command execution)

3. Volume start force should bring up shd on a node where it was killed
(confirmed on older release glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

Additional info:

Also there is discrepancy in output of vol status when executed from different
nodes.

[root at dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.132. Error: Volume test3 is not started
Staging failed on 10.70.46.216. Error: Volume test3 is not started

[root at dhcp46-132 ~]# gluster v status test3
Volume test3 is not started

[root at dhcp46-216 ~]# gluster v status test3
Volume test3 is not started

[root at dhcp46-216 ~]# gluster v start test3 force
volume start: test3: failed: Commit failed on dhcp35-50.lab.eng.blr.redhat.com.
Please check log file for details.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1726219
[Bug 1726219] Volume start failed when shd is down in one of the node in
cluster
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.