[Gluster-users] canceling full heal 3.8

Sat Aug 27 20:07:32 UTC 2016

On Sat, Aug 27, 2016 at 9:58 AM, David Gossage <dgossage at carouselchecks.com>
wrote:

> On Fri, Aug 26, 2016 at 8:40 PM, David Gossage <
> dgossage at carouselchecks.com> wrote:
>
>> I was in process of redoing underlying disk layout for a brick.
>>  triggered full heal.  then realized I had skipped a step of applying zfs
>> set xattr=sa which is kind of important running zfs under linux.
>>
>> Rather than wait however many hours until my TB of data heals is their a
>> command in 3.8 to cancel a heal begun by gluster volume heal GLUSTER1
>> full?  If not won't be end of world just waste of time to wait and then
>> have to redo after writing out a TB of data.
>>
>>
> Does the heal process crawl from any particular node when invoked?  I have
> 3 nodes.  I ran command from node 3, node 2 is one with files needing
> healed, node 1 is brick I heaeld yesterday but forgot to set xattr=sa on
> which usually has bad performance results for zfsonlinux.  I did set it
> about 30 minutes into the heal figuring better some than none until I could
> redo it again.
>
> 12 hours later the 1TB of data was healed so I figured I'd move on to node
> 2, then 3.  Then assuming 12 hour windows for each node I could redo node 1
> with correct settings before Monday.  When node 1 healed it first found all
> the visible files from mount point and .glusterfs, hen numbers jumped back
> up after those were done and it started finding shards.  It happened fairly
> quickly.  2nd time around with node 2 it is crawling to a standstill while
> finding all the shards to heal.  I'm wondering if its doing the crawl from
> node 1 and the poor settings that existed for first 30 minutes of file
> heals is slowing it down.  If so I would hope once the files that were
> created/healed while settings weren't correct are found and it moves past
> them the rest should go faster.
>
> The only errors in any logs are brick logs
>
> [2016-08-27 14:25:10.022786] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 3251237:
> LOOKUP (null) (00000000-0000-0000-0000-000000000000/4c7d44fc-a0c1-413b-8dc4-2abbbe1d4d4f.423)
> ==> (Invalid argument) [Invalid argument]
> [2016-08-27 14:36:59.234073] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
> type for (null) (LOOKUP)
> [2016-08-27 14:36:59.234128] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 3288322:
> LOOKUP (null) (00000000-0000-0000-0000-000000000000/4c7d44fc-a0c1-413b-8dc4-2abbbe1d4d4f.328)
> ==> (Invalid argument) [Invalid argument]
>
> And I would hope that it's just related to heal process or when a shard is
> hit and its found it doesnt exist here it errors out as expected.
>
>
>
7 hours after starting full heal shards still haven't started healing, and
count from heal statistics heal-count has only reached 1800 out of 19000
shards.  shards dir hasn't even been recreated yet.  Creation of the non
sharded stubs (do they have a more official term?) in the visible mount
point was as speedy as expected.  shards are painfully slow.

>> *David Gossage*
>> *Carousel Checks Inc. | System Administrator*
>> *Office* 708.613.2284
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160827/6bddd6ce/attachment.html>