[Bugs] [Bug 1398566] New: self-heal info command hangs after triggering self-heal

Fri Nov 25 09:55:28 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1398566

            Bug ID: 1398566
           Summary: self-heal info command hangs after triggering
                    self-heal
           Product: GlusterFS
           Version: mainline
         Component: replicate
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    rhs-bugs at redhat.com, sasundar at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1396166

+++ This bug was initially created as a clone of Bug #1396166 +++

Description of problem:
------------------------
After issuing 'gluster volume heal', 'gluster volume heal info' hangs, when
compound-fops is enabled on the replica 3 volume

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEL 7.3
RHGS 3.2.0 interim build ( glusterfs-3.8.4-5.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a replica 3 volume
2. Optimize the volume for VM store usecase
3. Enable compound-fops on the volume
4. Create a VM, and install OS
5. While OS installation is in progress, kill brick1 on server1
6. After VM installation is completed, bring back the brick up
7. Trigger self-heal on the volume
8. Get the self-heal info

Actual results:
---------------
self-heal info command is hung

Expected results:
-----------------
'self-heal info' should provide the correct information about un-synced entries

Additional info:
----------------
When compound-fops is disabled on the volume, this issue is not seen

--- Additional comment from SATHEESARAN on 2016-11-17 11:19:40 EST ---

1. Cluster info
---------------
There are 3 hosts in the cluster. All of them are VMs installed with RHGS
interim build over RHEL 7.3

[root at Server1 ~]# gluster peer status
Number of Peers: 2

Hostname: server2
Uuid: 209154aa-836f-47c1-8446-a5c5d15eb566
State: Peer in Cluster (Connected)

Hostname: server3
Uuid: e88a05e5-7772-4b31-9b7f-a1de1509adb7
State: Peer in Cluster (Connected)

2. gluster volume info
-----------------------
[root at server1 ~]# gluster volume info

Volume Name: volume1
Type: Replicate
Volume ID: aa01f3d2-4ba2-4747-893e-84058788f1dd
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster/brick1/b1
Brick2: server2:/gluster/brick1/b1
Brick3: server3:/gluster/brick1/b1
Options Reconfigured:
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 107
storage.owner-uid: 107
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

--- Additional comment from Krutika Dhananjay on 2016-11-22 10:22:39 EST ---

You do have the brick statedump too, don't you? Could you please attach those
as well?

-Krutika

--- Additional comment from SATHEESARAN on 2016-11-23 02:04:43 EST ---

(In reply to Krutika Dhananjay from comment #7)
> You do have the brick statedump too, don't you? Could you please attach
> those as well?
> 
> -Krutika

Hi Krutika,

I have mistakenly re-provisioned my third server in the cluster to simulate
failed node scenario.

But I have brick statedump from server1 and server2. I will attach them

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1396166
[Bug 1396166] self-heal info command hangs after triggering self-heal
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.