[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Mon Aug 29 09:01:35 UTC 2016

Centos 7 Gluster 3.8.3

Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
Options Reconfigured:
cluster.data-self-heal-algorithm: full
cluster.self-heal-daemon: on
cluster.locking-scheme: granular
features.shard-block-size: 64MB
features.shard: on
performance.readdir-ahead: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
cluster.self-heal-window-size: 1024
cluster.background-self-heal-count: 16
performance.strict-write-ordering: off
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
cluster.granular-entry-heal: on

Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
Following steps detailed in previous recommendations began proces of
replacing and healngbricks one node at a time.

1) kill pid of brick
2) reconfigure brick from raid6 to raid10
3) recreate directory of brick
4) gluster volume start <> force
5) gluster volume heal <> full

1st node worked as expected took 12 hours to heal 1TB data.  Load was
little heavy but nothing shocking.

About an hour after node 1 finished I began same process on node2.  Heal
proces kicked in as before and the files in directories visible from mount
and .glusterfs healed in short time.  Then it began crawl of .shard adding
those files to heal count at which point the entire proces ground to a halt
basically.  After 48 hours out of 19k shards it has added 5900 to heal
list.  Load on all 3 machnes is negligible.   It was suggested to change
this value to full cluster.data-self-heal-algorithm and restart volume
which I did.  No efffect.  Tried relaunching heal no effect, despite any
node picked.  I started each VM and performed a stat of all files from
within it, or a full virus scan  and that seemed to cause short small
spikes in shards added, but not by much.  Logs are showing no real messages
indicating anything is going on.  I get hits to brick log on occasion of
null lookups making me think its not really crawling shards directory but
waiting for a shard lookup to add it.  I'll get following in brick log but
not constant and sometime multiple for same shard.

[2016-08-29 08:31:57.478125] W [MSGID: 115009]
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
for (null) (LOOKUP)
[2016-08-29 08:31:57.478170] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
LOOKUP (null) (00000000-0000-0000-00
00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
argument) [Invalid argument]

This one repeated about 30 times in row then nothing for 10 minutes then
one hit for one different shard by itself.

How can I determine if Heal is actually running?  How can I kill it or
force restart?  Does node I start it from determine which directory gets
crawled to determine heals?

*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160829/c484957e/attachment.html>