[Gluster-users] 3.8.3 Shards Healing Glacier Slow

Mon Aug 29 11:42:42 UTC 2016

On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <atalur at redhat.com> wrote:

> Response inline.
>
> ----- Original Message -----
> > From: "Krutika Dhananjay" <kdhananj at redhat.com>
> > To: "David Gossage" <dgossage at carouselchecks.com>
> > Cc: "gluster-users at gluster.org List" <Gluster-users at gluster.org>
> > Sent: Monday, August 29, 2016 3:55:04 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > Could you attach both client and brick logs? Meanwhile I will try these
> steps
> > out on my machines and see if it is easily recreatable.
> >
> > -Krutika
> >
> > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> dgossage at carouselchecks.com
> > > wrote:
> >
> >
> >
> > Centos 7 Gluster 3.8.3
> >
> > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > Options Reconfigured:
> > cluster.data-self-heal-algorithm: full
> > cluster.self-heal-daemon: on
> > cluster.locking-scheme: granular
> > features.shard-block-size: 64MB
> > features.shard: on
> > performance.readdir-ahead: on
> > storage.owner-uid: 36
> > storage.owner-gid: 36
> > performance.quick-read: off
> > performance.read-ahead: off
> > performance.io-cache: off
> > performance.stat-prefetch: on
> > cluster.eager-lock: enable
> > network.remote-dio: enable
> > cluster.quorum-type: auto
> > cluster.server-quorum-type: server
> > server.allow-insecure: on
> > cluster.self-heal-window-size: 1024
> > cluster.background-self-heal-count: 16
> > performance.strict-write-ordering: off
> > nfs.disable: on
> > nfs.addr-namelookup: off
> > nfs.enable-ino32: off
> > cluster.granular-entry-heal: on
> >
> > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > Following steps detailed in previous recommendations began proces of
> > replacing and healngbricks one node at a time.
> >
> > 1) kill pid of brick
> > 2) reconfigure brick from raid6 to raid10
> > 3) recreate directory of brick
> > 4) gluster volume start <> force
> > 5) gluster volume heal <> full
> Hi,
>
> I'd suggest that full heal is not used. There are a few bugs in full heal.
> Better safe than sorry ;)
> Instead I'd suggest the following steps:
>
> Currently I brought the node down by systemctl stop glusterd as I was
getting sporadic io issues and a few VM's paused so hoping that will help.
I may wait to do this till around 4PM when most work is done in case it
shoots load up.

> 1) kill pid of brick
> 2) to configuring of brick that you need
> 3) recreate brick dir
> 4) while the brick is still down, from the mount point:
>    a) create a dummy non existent dir under / of mount.
>

so if noee 2 is down brick, pick node for example 3 and make a test dir
under its brick directory that doesnt exist on 2 or should I be dong this
over a gluster mount?

>    b) set a non existent extended attribute on / of mount.
>

Could you give me an example of an attribute to set?   I've read a tad on
this, and looked up attributes but haven't set any yet myself.

Doing these steps will ensure that heal happens only from updated brick to
> down brick.
> 5) gluster v start <> force
> 6) gluster v heal <>
>

Will it matter if somewhere in gluster the full heal command was run other
day?  Not sure if it eventually stops or times out.

>
> > 1st node worked as expected took 12 hours to heal 1TB data. Load was
> little
> > heavy but nothing shocking.
> >
> > About an hour after node 1 finished I began same process on node2. Heal
> > proces kicked in as before and the files in directories visible from
> mount
> > and .glusterfs healed in short time. Then it began crawl of .shard adding
> > those files to heal count at which point the entire proces ground to a
> halt
> > basically. After 48 hours out of 19k shards it has added 5900 to heal
> list.
> > Load on all 3 machnes is negligible. It was suggested to change this
> value
> > to full cluster.data-self-heal-algorithm and restart volume which I
> did. No
> > efffect. Tried relaunching heal no effect, despite any node picked. I
> > started each VM and performed a stat of all files from within it, or a
> full
> > virus scan and that seemed to cause short small spikes in shards added,
> but
> > not by much. Logs are showing no real messages indicating anything is
> going
> > on. I get hits to brick log on occasion of null lookups making me think
> its
> > not really crawling shards directory but waiting for a shard lookup to
> add
> > it. I'll get following in brick log but not constant and sometime
> multiple
> > for same shard.
> >
> > [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> > [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
> type
> > for (null) (LOOKUP)
> > [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> > [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> > LOOKUP (null) (00000000-0000-0000-00
> > 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> > argument) [Invalid argument]
> >
> > This one repeated about 30 times in row then nothing for 10 minutes then
> one
> > hit for one different shard by itself.
> >
> > How can I determine if Heal is actually running? How can I kill it or
> force
> > restart? Does node I start it from determine which directory gets
> crawled to
> > determine heals?
> >
> > David Gossage
> > Carousel Checks Inc. | System Administrator
> > Office 708.613.2284
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>
> --
> Thanks,
> Anuradha.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160829/54c0603e/attachment.html>