[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Krutika Dhananjay kdhananj at redhat.com
Mon Aug 29 12:25:26 UTC 2016


Got it. Thanks.

I tried the same test and shd crashed with SIGABRT (well, that's because I
compiled from src with -DDEBUG).
In any case, this error would prevent full heal from proceeding further.
I'm debugging the crash now. Will let you know when I have the RC.

-Krutika

On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <dgossage at carouselchecks.com>
wrote:

>
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
> dgossage at carouselchecks.com> wrote:
>
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhananj at redhat.com>
>> wrote:
>>
>>> Could you attach both client and brick logs? Meanwhile I will try these
>>> steps out on my machines and see if it is easily recreatable.
>>>
>>>
>> Hoping 7z files are accepted by mail server.
>>
>
> looks like zip file awaiting approval due to size
>
>>
>> -Krutika
>>>
>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>> dgossage at carouselchecks.com> wrote:
>>>
>>>> Centos 7 Gluster 3.8.3
>>>>
>>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>>> Options Reconfigured:
>>>> cluster.data-self-heal-algorithm: full
>>>> cluster.self-heal-daemon: on
>>>> cluster.locking-scheme: granular
>>>> features.shard-block-size: 64MB
>>>> features.shard: on
>>>> performance.readdir-ahead: on
>>>> storage.owner-uid: 36
>>>> storage.owner-gid: 36
>>>> performance.quick-read: off
>>>> performance.read-ahead: off
>>>> performance.io-cache: off
>>>> performance.stat-prefetch: on
>>>> cluster.eager-lock: enable
>>>> network.remote-dio: enable
>>>> cluster.quorum-type: auto
>>>> cluster.server-quorum-type: server
>>>> server.allow-insecure: on
>>>> cluster.self-heal-window-size: 1024
>>>> cluster.background-self-heal-count: 16
>>>> performance.strict-write-ordering: off
>>>> nfs.disable: on
>>>> nfs.addr-namelookup: off
>>>> nfs.enable-ino32: off
>>>> cluster.granular-entry-heal: on
>>>>
>>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>>> Following steps detailed in previous recommendations began proces of
>>>> replacing and healngbricks one node at a time.
>>>>
>>>> 1) kill pid of brick
>>>> 2) reconfigure brick from raid6 to raid10
>>>> 3) recreate directory of brick
>>>> 4) gluster volume start <> force
>>>> 5) gluster volume heal <> full
>>>>
>>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>>> little heavy but nothing shocking.
>>>>
>>>> About an hour after node 1 finished I began same process on node2.
>>>> Heal proces kicked in as before and the files in directories visible from
>>>> mount and .glusterfs healed in short time.  Then it began crawl of .shard
>>>> adding those files to heal count at which point the entire proces ground to
>>>> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
>>>> heal list.  Load on all 3 machnes is negligible.   It was suggested to
>>>> change this value to full cluster.data-self-heal-algorithm and restart
>>>> volume which I did.  No efffect.  Tried relaunching heal no effect, despite
>>>> any node picked.  I started each VM and performed a stat of all files from
>>>> within it, or a full virus scan  and that seemed to cause short small
>>>> spikes in shards added, but not by much.  Logs are showing no real messages
>>>> indicating anything is going on.  I get hits to brick log on occasion of
>>>> null lookups making me think its not really crawling shards directory but
>>>> waiting for a shard lookup to add it.  I'll get following in brick log but
>>>> not constant and sometime multiple for same shard.
>>>>
>>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
>>>> type for (null) (LOOKUP)
>>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
>>>> LOOKUP (null) (00000000-0000-0000-00
>>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
>>>> argument) [Invalid argument]
>>>>
>>>> This one repeated about 30 times in row then nothing for 10 minutes
>>>> then one hit for one different shard by itself.
>>>>
>>>> How can I determine if Heal is actually running?  How can I kill it or
>>>> force restart?  Does node I start it from determine which directory gets
>>>> crawled to determine heals?
>>>>
>>>> *David Gossage*
>>>> *Carousel Checks Inc. | System Administrator*
>>>> *Office* 708.613.2284
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160829/676af11d/attachment.html>


More information about the Gluster-users mailing list