[Bugs] [Bug 1651525] New: Issuing a "heal ... full" on a disperse volume causes permanent high CPU utilization.

Tue Nov 20 09:34:38 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1651525

            Bug ID: 1651525
           Summary: Issuing a "heal ... full" on a disperse volume causes
                    permanent high CPU utilization.
           Product: GlusterFS
           Version: 5
         Component: disperse
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: aspandey at redhat.com
                CC: bugs at gluster.org, jahernan at redhat.com,
                    jbyers at stonefly.com, vnosov at stonefly.com
        Depends On: 1636631
            Blocks: 1644681
   External Bug ID: Gluster.org Gerrit 21526

+++ This bug was initially created as a clone of Bug #1636631 +++

Issuing a "heal ... full" on a disperse volume causes permanent
high CPU utilization. 

This occurs even when the volume is completely empty. The CPU usage
is not due to healing I/O activity.

This only happens on disperse volumes, not on replica volumes. 

It happens in GlusterFS version 3.12.14, but does not happen
in version 3.7.18.

The high CPU utilization is by the 'glusterfs' SHD (self heal
daemon) process and is easily noticed using 'top'.

The 'glustershd.log' file shows that the disperse volume full
sweep keeps restarting and running forever:

[2018-10-06 00:56:11.245106] I [MSGID: 122059]
[ec-heald.c:415:ec_shd_full_healer] 0-disperse-vol-disperse-0: finished full
sweep on subvol disperse-vol-client-0
The message "I [MSGID: 122059] [ec-heald.c:406:ec_shd_full_healer]
0-disperse-vol-disperse-0: starting full sweep on subvol disperse-vol-client-0"
repeated 2 times between [2018-10-06 00:56:11.243637] and [2018-10-06
00:56:11.246885]
[2018-10-06 00:56:11.247966] I [MSGID: 122059]
[ec-heald.c:415:ec_shd_full_healer] 0-disperse-vol-disperse-0: finished full
sweep on subvol disperse-vol-client-2
The message "I [MSGID: 122059] [ec-heald.c:406:ec_shd_full_healer]
0-disperse-vol-disperse-0: starting full sweep on subvol disperse-vol-client-1"
repeated 3 times between [2018-10-06 00:56:11.239731] and [2018-10-06
00:56:11.248470]
[2018-10-06 00:56:11.248553] I [MSGID: 122059]
[ec-heald.c:406:ec_shd_full_healer] 0-disperse-vol-disperse-0: starting full
sweep on subvol disperse-vol-client-0
The message "I [MSGID: 122059] [ec-heald.c:406:ec_shd_full_healer]
0-disperse-vol-disperse-0: starting full sweep on subvol disperse-vol-client-2"
repeated 3 times between [2018-10-06 00:56:11.242392] and [2018-10-06
00:56:11.251262]
[2018-10-06 00:56:11.251330] I [MSGID: 122059]
[ec-heald.c:406:ec_shd_full_healer] 0-disperse-vol-disperse-0: starting full
sweep on subvol disperse-vol-client-1
The message "I [MSGID: 122059] [ec-heald.c:415:ec_shd_full_healer]
0-disperse-vol-disperse-0: finished full sweep on subvol disperse-vol-client-2"
repeated 2 times between [2018-10-06 00:56:11.247966] and [2018-10-06
00:56:11.253675]
[2018-10-06 00:56:11.253916] I [MSGID: 122059]
[ec-heald.c:406:ec_shd_full_healer] 0-disperse-vol-disperse-0: starting full
sweep on subvol disperse-vol-client-2
The message "I [MSGID: 122059] [ec-heald.c:406:ec_shd_full_healer]
0-disperse-vol-disperse-0: starting full sweep on subvol disperse-vol-client-0"
repeated 5 times between [2018-10-06 00:56:11.248553] and [2018-10-06
00:56:11.256142]
[2018-10-06 00:56:11.256490] I [MSGID: 122059]
[ec-heald.c:415:ec_shd_full_healer] 0-disperse-vol-disperse-0: finished full
sweep on subvol disperse-vol-client-2
The message "I [MSGID: 122059] [ec-heald.c:415:ec_shd_full_healer]
0-disperse-vol-disperse-0: finished full sweep on subvol disperse-vol-client-0"
repeated 8 times between [2018-10-06 00:56:11.245106] and [2018-10-06
00:56:11.257386]
[2018-10-06 00:56:11.257585] I [MSGID: 122059]
[ec-heald.c:406:ec_shd_full_healer] 0-disperse-vol-disperse-0: starting full
sweep on subvol disperse-vol-client-0
[2018-10-06 00:56:11.258907] I [MSGID: 122059]
[ec-heald.c:415:ec_shd_full_healer] 0-disperse-vol-disperse-0: finished full
sweep on subvol disperse-vol-client-0
[2018-10-06 00:56:11.259098] I [MSGID: 122059]
[ec-heald.c:406:ec_shd_full_healer] 0-disperse-vol-disperse-0: starting full
sweep on subvol disperse-vol-client-0
The message "I [MSGID: 122059] [ec-heald.c:406:ec_shd_full_healer]
0-disperse-vol-disperse-0: starting full sweep on subvol disperse-vol-client-1"
repeated 3 times between [2018-10-06 00:56:11.251330] and [2018-10-06
00:56:11.259751]
[2018-10-06 00:56:11.261599] I [MSGID: 122059]
[ec-heald.c:415:ec_shd_full_healer] 0-disperse-vol-disperse-0: finished full
sweep on subvol disperse-vol-client-0

The only way to reduce the shd glusterfs process high CPU
utilization is to kill it, and restart it. It is then fine
until the next disperse volume "heal ... full".

--- Additional comment from Shyamsundar on 2018-10-23 10:54:18 EDT ---

Release 3.12 has been EOLd and this bug was still found to be in the NEW state,
hence moving the version to mainline, to triage the same and take appropriate
actions.

--- Additional comment from Xavi Hernandez on 2018-10-31 07:50:19 EDT ---

I've found the problem. Currently, when a directory is healed, a flag is set
that forces heal to be retried. This is necessary after a replace brick because
after healing a directory, new entries to be healed could appear (the only bad
entry just after a replace brick is the root directory). In this case, a new
iteration of the heal process will immediately take those new entries and heal
them, instead of going idle after completing a full sweep of the (previous)
list of bad entries.

However this approach on a full self-heal causes it to run endless. First it
tries to heal the root directory, which succeeds. This causes the flag to be
set, even if no entries have been really added to be healed.

--- Additional comment from Worker Ant on 2018-10-31 07:52:30 EDT ---

REVIEW: https://review.gluster.org/21526 (cluster/ec: prevent infinite loop in
self-heal full) posted (#1) for review on master by Xavi Hernandez

--- Additional comment from Worker Ant on 2018-10-31 12:32:31 EDT ---

REVIEW: https://review.gluster.org/21526 (cluster/ec: prevent infinite loop in
self-heal full) posted (#1) for review on master by Xavi Hernandez

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1636631
[Bug 1636631] Issuing a "heal ... full" on a disperse volume causes
permanent high CPU utilization.
https://bugzilla.redhat.com/show_bug.cgi?id=1644681
[Bug 1644681] Issuing a "heal ... full" on a disperse volume causes
permanent high CPU utilization.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.