[Gluster-users] gluster 5.11 issue: Endless performing entry selfheal / Completed entry selfheal on the same pending heal files after server reboot + very high load

Sun Apr 26 22:13:01 UTC 2020

Hi all,

I've been observing this problem for a long time now and it's time to
finally figure out what's going on.

We're running gluster 5.11 and have a 10TB 1 x 4 = 4 replicate volume. I'll
include its slightly redacted config below.

When I reboot one of the servers and it goes offline for a bit, when it
comes back, heal info tells me there are some files and dirs that are "heal
pending". 0 "split-brain" and "possibly healing" - only "heal pending"
are >0.

   1. For some reason, the server that was rebooted shows "heal pending" 0.
   All other servers show "heal pending" with some number, say 65.
   2. We have cluster.self-heal-daemon enabled.
   3. The logs are full of "performing entry selfheal" and "completed entry
   selfheal" messages that continue to print endlessly.
   4. This "heal pending" number never goes down by itself, but it does if
   I run some operation on it, like md5sum.
   5. When the server goes down for reboot and especially when it comes
   back, the load on ALL servers shoots up through the roof (load of 100+) and
   ends up bringing everything down, including apache and nginx. My theory is
   that self-heal kicks in so hard that it kills IO on these attached Linode
   block devices. However, after some time - say 10 minutes - the load
   subsides, but the "heal pending" remains and the gluster logs continue to
   output "performing entry selfheal" and "completed entry selfheal" messages.
   This load spike has become a huge issue for us because it brings down the
   whole site for entire minutes.
   6. At this point in my investigation, I noticed that the
   selfheal messages actually repeat for the same gfids over and over.
   [2020-04-26 21:32:29.877987] I [MSGID: 108026]
   [afr-self-heal-entry.c:897:afr_selfheal_entry_do] 0-SNIP_data1-replicate-0:
   performing entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc
   [2020-04-26 21:32:29.901246] I [MSGID: 108026]
   [afr-self-heal-common.c:1729:afr_log_selfheal] 0-SNIP_data1-replicate-0:
   Completed entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc. sources=
   sinks=0 1 2
   [2020-04-26 21:32:32.171959] I [MSGID: 108026]
   [afr-self-heal-entry.c:897:afr_selfheal_entry_do] 0-SNIP_data1-replicate-0:
   performing entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc
   [2020-04-26 21:32:32.225828] I [MSGID: 108026]
   [afr-self-heal-common.c:1729:afr_log_selfheal] 0-SNIP_data1-replicate-0:
   Completed entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc. sources=
   sinks=0 1 2
   [2020-04-26 21:32:33.346990] I [MSGID: 108026]
   [afr-self-heal-entry.c:897:afr_selfheal_entry_do] 0-SNIP_data1-replicate-0:
   performing entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc
   [2020-04-26 21:32:33.374413] I [MSGID: 108026]
   [afr-self-heal-common.c:1729:afr_log_selfheal] 0-SNIP_data1-replicate-0:
   Completed entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc. sources=
   sinks=0 1 2
   7. I used gfid-resolver.sh from https://gist.github.com/4392640.git to
   resolve this gfid to the real location and yup - it was one of the files (a
   dir actually) listed as "heal pending" in heal info. As soon as I ran
   md5sum on the file inside (which was also listed in "heal pending"), the
   log messages stopped repeating for this entry and it disappeared from "heal
   pending" heal info. These were the final log lines:
   [2020-04-26 21:32:35.642662] I [MSGID: 108026]
   [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
   0-SNIP_data1-replicate-0: performing metadata selfheal on
   96d282cf-402f-455c-9add-5f03c088a1bc
   [2020-04-26 21:32:35.658714] I [MSGID: 108026]
   [afr-self-heal-common.c:1729:afr_log_selfheal] 0-SNIP_data1-replicate-0:
   Completed metadata selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc.
   sources=0 [1] 2  sinks=3
   [2020-04-26 21:32:35.686509] I [MSGID: 108026]
   [afr-self-heal-entry.c:897:afr_selfheal_entry_do] 0-SNIP_data1-replicate-0:
   performing entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc
   [2020-04-26 21:32:35.720387] I [MSGID: 108026]
   [afr-self-heal-common.c:1729:afr_log_selfheal] 0-SNIP_data1-replicate-0:
   Completed entry selfheal on 96d282cf-402f-455c-9add-5f03c088a1bc. sources=0
   [1] 2  sinks=3

I have to repeat this song and dance every time I reboot servers and run
md5sum on each "heal pending" file or else the messages will continue
presumably indefinitely. In the meantime, the files seem to be fine when
accessed.

What I don't understand is:

   1. Why doesn't gluster just heal them properly instead of getting stuck?
   Or maybe this was fixed in v6 or v7, which I haven't upgraded to due to
   waiting for another unrelated issue to be fixed?
   2. Why does heal info show 0 "heal pending" files on the server that was
   rebooted, but all other servers show the same number of "heal pending"
   entries >0?
   3. Why are there these insane load spikes upon going down and especially
   coming back online? Is it related to the issue here? I'm pretty sure that
   it didn't happen in previous versions of gluster, when this issue didn't
   manifest - I could easily bring down one of the servers without it creating
   havoc when it comes back online.

Here's the volume info:

Volume Name: SNIP_data1

Type: Replicate

Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 4 = 4

Transport-type: tcp

Bricks:

Brick1: SNIP:/mnt/SNIP_block1/SNIP_data1

Brick2: SNIP:/mnt/SNIP_block1/SNIP_data1

Brick3: SNIP:/mnt/SNIP_block1/SNIP_data1

Brick4: SNIP:/mnt/SNIP_block1/SNIP_data1

Options Reconfigured:

performance.client-io-threads: on

nfs.disable: on

transport.address-family: inet

cluster.self-heal-daemon: enable

performance.cache-size: 1GB

cluster.lookup-optimize: on

performance.read-ahead: off

client.event-threads: 4

server.event-threads: 4

performance.io-thread-count: 32

cluster.readdir-optimize: on

features.cache-invalidation: on

features.cache-invalidation-timeout: 600

performance.stat-prefetch: on

performance.cache-invalidation: on

performance.md-cache-timeout: 600

network.inode-lru-limit: 500000

performance.parallel-readdir: on

performance.readdir-ahead: on

performance.rda-cache-limit: 256MB

network.remote-dio: enable

network.ping-timeout: 5

cluster.quorum-type: fixed

cluster.quorum-count: 1

cluster.granular-entry-heal: enable

cluster.data-self-heal-algorithm: full

Appreciate any insight. Thank you.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200426/ec0eac04/attachment.html>