[Gluster-users] Healing Delays

Sat Oct 1 14:48:22 UTC 2016

This was raised earlier but I don't believe it was ever resolved and it 
is becoming a serious issue for me.

I'm doing rolling upgrades on our three node cluster (Replica 3, 
Sharded, VM Workload).

I update one node, reboot it, wait for healing to complete, do the next one.

Only the heal count does not change, it just does not seem to start. It 
can take hours before it shifts, but once it does, its quite rapid. Node 
1 has restarted and the heal count has been static at 511 shards for 45 
minutes now. Nodes 1 & 2 have low CPU load, node 3 has glusterfsd pegged 
at 800% CPU.

This was *not* the case in earlier versions of gluster (3.7.11 I think), 
healing would start almost right away. I think it started doing this 
when the afr locking improvements where made.

I have experimented with full & diff heal modes, doesn't make any 
difference.

Current:

Gluster Version 4.8.4

Volume Name: datastore4
Type: Replicate
Volume ID: 0ba131ef-311d-4bb1-be46-596e83b2f6ce
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: vnb.proxmox.softlog:/tank/vmdata/datastore4
Brick2: vng.proxmox.softlog:/tank/vmdata/datastore4
Brick3: vna.proxmox.softlog:/tank/vmdata/datastore4
Options Reconfigured:
cluster.self-heal-window-size: 1024
cluster.locking-scheme: granular
cluster.granular-entry-heal: on
performance.readdir-ahead: on
cluster.data-self-heal: on
features.shard: on
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
performance.strict-write-ordering: off
performance.stat-prefetch: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
features.shard-block-size: 64MB
cluster.background-self-heal-count: 16

Thanks,

-- 
Lindsay Mathieson