[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Krutika Dhananjay kdhananj at redhat.com
Tue Aug 30 04:44:54 UTC 2016


OK. Do you also have granular-entry-heal on - just so that I can isolate
the problem area.

-Krutika

On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic <budic at onholyground.com>
wrote:

> I noticed that my new brick (replacement disk) did not have a .shard
> directory created on the brick, if that helps.
>
> I removed the affected brick from the volume and then wiped the disk, did
> an add-brick, and everything healed right up. I didn’t try and set any
> attrs or anything else, just removed and added the brick as new.
>
> On Aug 29, 2016, at 9:49 AM, Darrell Budic <budic at onholyground.com> wrote:
>
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
> Some content was healed correctly, now all the shards are queued up in a
> heal list, but nothing is healing. Got similar brick errors logged to the
> ones David was getting on the brick that isn’t healing:
>
> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
> LOOKUP (null) (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
> ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
> LOOKUP (null) (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
> ==> (Invalid argument) [Invalid argument]
>
> This was after replacing the drive the brick was on and trying to get it
> back into the system by setting the volume's fattr on the brick dir. I’ll
> try the suggested method here on it it shortly.
>
>   -Darrell
>
>
> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
>
> Got it. Thanks.
>
> I tried the same test and shd crashed with SIGABRT (well, that's because I
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
>
> -Krutika
>
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
> dgossage at carouselchecks.com> wrote:
>
>>
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>> dgossage at carouselchecks.com> wrote:
>>
>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhananj at redhat.com>
>>> wrote:
>>>
>>>> Could you attach both client and brick logs? Meanwhile I will try these
>>>> steps out on my machines and see if it is easily recreatable.
>>>>
>>>>
>>> Hoping 7z files are accepted by mail server.
>>>
>>
>> looks like zip file awaiting approval due to size
>>
>>>
>>> -Krutika
>>>>
>>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>>> dgossage at carouselchecks.com> wrote:
>>>>
>>>>> Centos 7 Gluster 3.8.3
>>>>>
>>>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>>>> Options Reconfigured:
>>>>> cluster.data-self-heal-algorithm: full
>>>>> cluster.self-heal-daemon: on
>>>>> cluster.locking-scheme: granular
>>>>> features.shard-block-size: 64MB
>>>>> features.shard: on
>>>>> performance.readdir-ahead: on
>>>>> storage.owner-uid: 36
>>>>> storage.owner-gid: 36
>>>>> performance.quick-read: off
>>>>> performance.read-ahead: off
>>>>> performance.io-cache: off
>>>>> performance.stat-prefetch: on
>>>>> cluster.eager-lock: enable
>>>>> network.remote-dio: enable
>>>>> cluster.quorum-type: auto
>>>>> cluster.server-quorum-type: server
>>>>> server.allow-insecure: on
>>>>> cluster.self-heal-window-size: 1024
>>>>> cluster.background-self-heal-count: 16
>>>>> performance.strict-write-ordering: off
>>>>> nfs.disable: on
>>>>> nfs.addr-namelookup: off
>>>>> nfs.enable-ino32: off
>>>>> cluster.granular-entry-heal: on
>>>>>
>>>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>>>> Following steps detailed in previous recommendations began proces of
>>>>> replacing and healngbricks one node at a time.
>>>>>
>>>>> 1) kill pid of brick
>>>>> 2) reconfigure brick from raid6 to raid10
>>>>> 3) recreate directory of brick
>>>>> 4) gluster volume start <> force
>>>>> 5) gluster volume heal <> full
>>>>>
>>>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>>>> little heavy but nothing shocking.
>>>>>
>>>>> About an hour after node 1 finished I began same process on node2.
>>>>> Heal proces kicked in as before and the files in directories visible from
>>>>> mount and .glusterfs healed in short time.  Then it began crawl of .shard
>>>>> adding those files to heal count at which point the entire proces ground to
>>>>> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
>>>>> heal list.  Load on all 3 machnes is negligible.   It was suggested to
>>>>> change this value to full cluster.data-self-heal-algorithm and
>>>>> restart volume which I did.  No efffect.  Tried relaunching heal no effect,
>>>>> despite any node picked.  I started each VM and performed a stat of all
>>>>> files from within it, or a full virus scan  and that seemed to cause short
>>>>> small spikes in shards added, but not by much.  Logs are showing no real
>>>>> messages indicating anything is going on.  I get hits to brick log on
>>>>> occasion of null lookups making me think its not really crawling shards
>>>>> directory but waiting for a shard lookup to add it.  I'll get following in
>>>>> brick log but not constant and sometime multiple for same shard.
>>>>>
>>>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no
>>>>> resolution type for (null) (LOOKUP)
>>>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server:
>>>>> 12591783: LOOKUP (null) (00000000-0000-0000-00
>>>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==>
>>>>> (Invalid argument) [Invalid argument]
>>>>>
>>>>> This one repeated about 30 times in row then nothing for 10 minutes
>>>>> then one hit for one different shard by itself.
>>>>>
>>>>> How can I determine if Heal is actually running?  How can I kill it or
>>>>> force restart?  Does node I start it from determine which directory gets
>>>>> crawled to determine heals?
>>>>>
>>>>> *David Gossage*
>>>>> *Carousel Checks Inc. | System Administrator*
>>>>> *Office* 708.613.2284
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>
>>>>
>>>
>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160830/ae99efd8/attachment.html>


More information about the Gluster-users mailing list