[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Tue Aug 30 05:03:07 UTC 2016

Ignore. I just realised you're on 3.7.14. So then the problem may not be
with granular entry self-heal feature.

-Krutika

On Tue, Aug 30, 2016 at 10:14 AM, Krutika Dhananjay <kdhananj at redhat.com>
wrote:

> OK. Do you also have granular-entry-heal on - just so that I can isolate
> the problem area.
>
> -Krutika
>
> On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic <budic at onholyground.com>
> wrote:
>
>> I noticed that my new brick (replacement disk) did not have a .shard
>> directory created on the brick, if that helps.
>>
>> I removed the affected brick from the volume and then wiped the disk, did
>> an add-brick, and everything healed right up. I didn’t try and set any
>> attrs or anything else, just removed and added the brick as new.
>>
>> On Aug 29, 2016, at 9:49 AM, Darrell Budic <budic at onholyground.com>
>> wrote:
>>
>> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
>> Some content was healed correctly, now all the shards are queued up in a
>> heal list, but nothing is healing. Got similar brick errors logged to the
>> ones David was getting on the brick that isn’t healing:
>>
>> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
>> LOOKUP (null) (00000000-0000-0000-0000-000000000000
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) ==> (Invalid argument)
>> [Invalid argument]
>> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
>> LOOKUP (null) (00000000-0000-0000-0000-000000000000
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) ==> (Invalid argument)
>> [Invalid argument]
>>
>> This was after replacing the drive the brick was on and trying to get it
>> back into the system by setting the volume's fattr on the brick dir. I’ll
>> try the suggested method here on it it shortly.
>>
>>   -Darrell
>>
>>
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhananj at redhat.com>
>> wrote:
>>
>> Got it. Thanks.
>>
>> I tried the same test and shd crashed with SIGABRT (well, that's because
>> I compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>>
>> -Krutika
>>
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
>> dgossage at carouselchecks.com> wrote:
>>
>>>
>>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>>> dgossage at carouselchecks.com> wrote:
>>>
>>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhananj at redhat.com
>>>> > wrote:
>>>>
>>>>> Could you attach both client and brick logs? Meanwhile I will try
>>>>> these steps out on my machines and see if it is easily recreatable.
>>>>>
>>>>>
>>>> Hoping 7z files are accepted by mail server.
>>>>
>>>
>>> looks like zip file awaiting approval due to size
>>>
>>>>
>>>> -Krutika
>>>>>
>>>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>>>> dgossage at carouselchecks.com> wrote:
>>>>>
>>>>>> Centos 7 Gluster 3.8.3
>>>>>>
>>>>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>>>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>>>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>>>>> Options Reconfigured:
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.self-heal-daemon: on
>>>>>> cluster.locking-scheme: granular
>>>>>> features.shard-block-size: 64MB
>>>>>> features.shard: on
>>>>>> performance.readdir-ahead: on
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> performance.quick-read: off
>>>>>> performance.read-ahead: off
>>>>>> performance.io-cache: off
>>>>>> performance.stat-prefetch: on
>>>>>> cluster.eager-lock: enable
>>>>>> network.remote-dio: enable
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.server-quorum-type: server
>>>>>> server.allow-insecure: on
>>>>>> cluster.self-heal-window-size: 1024
>>>>>> cluster.background-self-heal-count: 16
>>>>>> performance.strict-write-ordering: off
>>>>>> nfs.disable: on
>>>>>> nfs.addr-namelookup: off
>>>>>> nfs.enable-ino32: off
>>>>>> cluster.granular-entry-heal: on
>>>>>>
>>>>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>>>>> Following steps detailed in previous recommendations began proces of
>>>>>> replacing and healngbricks one node at a time.
>>>>>>
>>>>>> 1) kill pid of brick
>>>>>> 2) reconfigure brick from raid6 to raid10
>>>>>> 3) recreate directory of brick
>>>>>> 4) gluster volume start <> force
>>>>>> 5) gluster volume heal <> full
>>>>>>
>>>>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>>>>> little heavy but nothing shocking.
>>>>>>
>>>>>> About an hour after node 1 finished I began same process on node2.
>>>>>> Heal proces kicked in as before and the files in directories visible from
>>>>>> mount and .glusterfs healed in short time.  Then it began crawl of .shard
>>>>>> adding those files to heal count at which point the entire proces ground to
>>>>>> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
>>>>>> heal list.  Load on all 3 machnes is negligible.   It was suggested to
>>>>>> change this value to full cluster.data-self-heal-algorithm and
>>>>>> restart volume which I did.  No efffect.  Tried relaunching heal no effect,
>>>>>> despite any node picked.  I started each VM and performed a stat of all
>>>>>> files from within it, or a full virus scan  and that seemed to cause short
>>>>>> small spikes in shards added, but not by much.  Logs are showing no real
>>>>>> messages indicating anything is going on.  I get hits to brick log on
>>>>>> occasion of null lookups making me think its not really crawling shards
>>>>>> directory but waiting for a shard lookup to add it.  I'll get following in
>>>>>> brick log but not constant and sometime multiple for same shard.
>>>>>>
>>>>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>>>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no
>>>>>> resolution type for (null) (LOOKUP)
>>>>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>>>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server:
>>>>>> 12591783: LOOKUP (null) (00000000-0000-0000-00
>>>>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==>
>>>>>> (Invalid argument) [Invalid argument]
>>>>>>
>>>>>> This one repeated about 30 times in row then nothing for 10 minutes
>>>>>> then one hit for one different shard by itself.
>>>>>>
>>>>>> How can I determine if Heal is actually running?  How can I kill it
>>>>>> or force restart?  Does node I start it from determine which directory gets
>>>>>> crawled to determine heals?
>>>>>>
>>>>>> *David Gossage*
>>>>>> *Carousel Checks Inc. | System Administrator*
>>>>>> *Office* 708.613.2284
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160830/e45e563d/attachment.html>