[Bugs] [Bug 1760467] New: rebalance start is succeeding when quorum is not met

Thu Oct 10 15:20:25 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1760467

            Bug ID: 1760467
           Summary: rebalance start is succeeding when quorum is not met
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
            Status: NEW
         Component: glusterd
          Keywords: Regression
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: srakonde at redhat.com
                CC: amukherj at redhat.com, bmekala at redhat.com,
                    bugs at gluster.org, rhs-bugs at redhat.com,
                    sheggodu at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com
        Depends On: 1760261
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1760261 +++

Description of problem:
On a three node cluster with quorum enabled on a replicated volume. Performed
add-brick, stopped glusterd on one node then started rebalance on the volume.

gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started
successfully. Use rebalance status command to check status of the rebalance
process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

Rebalance start is succeeding when quorum is not met.

Version-Release number of selected component (if applicable):
glusterfs-server-6.0-15.el7rhgs.x86_64

How reproducible:
2/2

Steps to Reproduce:
1.On a three node cluster, create a 1X3 replicate volume 
2. Set "cluster.server-quorum-type" as server and set the ratio to 90.
3. Performed add-brick(3 bricks)
4. stopped glusterd on one node.
5. perform rebalance start

Actual results:

 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started
successfully. Use rebalance status command to check status of the rebalance
process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234

rebalance start is successful when quorum not met

Expected results:
rebalance start should not succeed when quorum not met

Additional info:

#### gluster vol info
[root at dhcp35-11 ~]# gluster vol info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: c9822762-7dac-47bd-8645-9cfee3d02b00
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.11:/bricks/brick4/testvol
Brick2: 10.70.35.7:/bricks/brick4/testvol
Brick3: 10.70.35.73:/bricks/brick4/testvol
Brick4: 10.70.35.73:/bricks/brick4/ht
Brick5: 10.70.35.11:/bricks/brick4/ht
Brick6: 10.70.35.7:/bricks/brick4/ht
Options Reconfigured:
cluster.server-quorum-type: server
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.server-quorum-ratio: 90

#### gluster vol status 

 gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    49152     0          Y       11039
Brick 10.70.35.7:/bricks/brick4/testvol     49152     0          Y       27266
Brick 10.70.35.73:/bricks/brick4/testvol    49152     0          Y       10746
Brick 10.70.35.73:/bricks/brick4/ht         49153     0          Y       11028
Brick 10.70.35.11:/bricks/brick4/ht         49153     0          Y       11338
Brick 10.70.35.7:/bricks/brick4/ht          49153     0          Y       27551
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on 10.70.35.73             N/A       N/A        Y       11053
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577

Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

#### After stopping glusterd on one node volume status

[root at dhcp35-11 ~]# gluster vol status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.11:/bricks/brick4/testvol    N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/testvol     N/A       N/A        N       N/A  
Brick 10.70.35.11:/bricks/brick4/ht         N/A       N/A        N       N/A  
Brick 10.70.35.7:/bricks/brick4/ht          N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       11363
Self-heal Daemon on dhcp35-7.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       27577

Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks

 gluster vol rebalance testvol start
volume rebalance: testvol: success: Rebalance on testvol has been started
successfully. Use rebalance status command to check status of the rebalance
process.
ID: 86cfc8b1-1e24-4244-b8e0-6941f4684234
[root at dhcp35-11 ~]# gluster vol rebalance testvol status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
         dhcp35-7.lab.eng.blr.redhat.com                0        0Bytes        
    0             0             0               failed        0:00:00
                               localhost                0        0Bytes        
    0             0             0               failed        0:00:00
volume rebalance: testvol: success

### glusterd log after stopping glusterd on one of the node
[2019-10-10 09:19:00.361314] I [MSGID: 106004]
[glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer
<10.70.35.73> (<53117ee2-5182-42c6-8c74-26f43b075a0c>), in state <Peer in
Cluster>, has disconnected from glusterd.
[2019-10-10 09:19:00.361553] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24f6a) [0x7fe6a4b4df6a]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2f790) [0x7fe6a4b58790]
-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf3883) [0x7fe6a4c1c883]
) 0-management: Lock for vol testvol not held
[2019-10-10 09:19:00.361570] W [MSGID: 106117]
[glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not
released for testvol
[2019-10-10 09:19:00.361607] C [MSGID: 106002]
[glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action] 0-management:
Server quorum lost for volume testvol. Stopping local bricks.
[2019-10-10 09:19:00.361825] I [MSGID: 106542]
[glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to
brick with pid 11039
[2019-10-10 09:19:01.362068] I [socket.c:871:__socket_shutdown] 0-management:
intentional socket shutdown(16)
[2019-10-10 09:19:01.362680] I [MSGID: 106542]
[glusterd-utils.c:8775:glusterd_brick_signal] 0-glusterd: sending signal 15 to
brick with pid 11338
[2019-10-10 09:19:02.362982] I [socket.c:871:__socket_shutdown] 0-management:
intentional socket shutdown(20)
[2019-10-10 09:19:02.363239] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/bricks/brick4/testvol on port 49152
[2019-10-10 09:19:02.368590] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/bricks/brick4/ht on port 49153
[2019-10-10 09:19:02.375567] I [MSGID: 106499]
[glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management:
Received status volume req for volume testvol
[2019-10-10 09:19:25.717254] I [MSGID: 106539]
[glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management:
Generated task-id 86cfc8b1-1e24-4244-b8e0-6941f4684234 for key rebalance-id
[2019-10-10 09:19:30.751060] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-10-10 09:19:30.751284] E [MSGID: 106061]
[glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd:
failed to get index from rsp dict
[2019-10-10 09:19:35.761694] E [MSGID: 106061]
[glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd:
failed to get index from rsp dict
[2019-10-10 09:19:35.767505] I [MSGID: 106172]
[glusterd-handshake.c:1085:__server_event_notify] 0-glusterd: received defrag
status updated
[2019-10-10 09:19:35.773243] I [MSGID: 106007]
[glusterd-rebalance.c:153:__glusterd_defrag_notify] 0-management: Rebalance
process for volume testvol has disconnected.
[2019-10-10 09:19:39.436119] E [MSGID: 106061]
[glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd:
failed to get index from rsp dict
[2019-10-10 09:19:39.436978] E [MSGID: 106061]
[glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd:
failed to get index from rsp dict
[2019-10-10 09:31:36.682991] I [MSGID: 106488]
[glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management:
Received get vol req
[2019-10-10 09:31:36.684006] I [MSGID: 106488]
[glusterd-handler.c:1564:__glusterd_handle_cli_get_volume] 0-management:
Received get vol req

--- Additional comment from RHEL Product and Program Management on 2019-10-10
15:06:22 IST ---

This bug is automatically being proposed for the next minor release of Red Hat
Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Bala Konda Reddy M on 2019-10-10 15:16:01 IST ---

Setup is in same state for further debugging.

Ip: 10.70.35.11 
credentials: root/1

Regards,
Bala

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1760261
[Bug 1760261] rebalance start is succeeding when quorum is not met
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.