[Bugs] [Bug 1807431] New: Setting cluster.heal-timeout requires volume restart

Wed Feb 26 10:47:38 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1807431

            Bug ID: 1807431
           Summary: Setting cluster.heal-timeout requires volume restart
           Product: GlusterFS
           Version: 5
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: selfheal
          Keywords: Triaged
          Severity: low
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org, glenk1973 at hotmail.com,
                    ravishankar at redhat.com
        Depends On: 1743988, 1744548
            Blocks: 1747301
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1744548 +++

+++ This bug was initially created as a clone of Bug #1743988 +++

Description of problem:
Setting the `cluster.heal-timeout` requires a volume restart to take effect.

Version-Release number of selected component (if applicable):
6.5

How reproducible:
Every time

Steps to Reproduce:
1. Provision a 3-peer replica volume (I used three docker containers).
2. Set `cluster.favorite-child-policy` to `mtime`.
3. Mount the volume on one of the containers (say `gluster-0`, serving as a
server and a client).
4. Stop the self-heal daemon.
5. Set `cluster.entry-self-heal`, `cluster.data-self-heal` and
`cluster.metadata-self-heal` to off.
6. Set `cluster.quorum-type` to none.
7. Write "first write" to file `test.txt` on the mounted volume.
8. Kill the brick process `gluster-2`.
9. Write "second write" to `test.txt`.
10. Force start the volume (`gluster volume start <volume> force`)
11. Kill brick processes `gluster-0` and `gluster-1`.
12. Write "third write" to `test.txt`.
13. Force start the volume.
14. Verify that "split-brain" appears in the output of `gluster volume heal
<volume> info` command.
15. Set `cluster.heal-timeout` to `60`.
16. Start the self-heal daemon.
17. Issue `gluster volume heal <volume> info` command after 70 seconds.
18. Verify that the output at step 17 does not contain "split-brain".
19. Verify that the content of `test.txt` is "third write". 

Actual results:
The output at step 17 contains "split-brain".

Expected results:
The output at step 17 should _not_ contain "split-brain".

Additional info:
According to what Ravishankar N said on Slack
(https://gluster.slack.com/archives/CH9M2KF60/p1566346818102000), changing
volume options such as `cluster.heal-timeout` should not require a process
restart. If I add a `gluster volume start <volume> force` command immediately
after step 16 above, then I get the Expected results.

--- Additional comment from Glen K on 2019-08-21 06:04:23 UTC ---

I should add that `cluster.quorum-type` is set to `none` for the test.

--- Additional comment from Ravishankar N on 2019-08-21 09:56:54 UTC ---

Okay, so after some investigation, I don't think this is an issue. When you
change the heal-timeout, it does get propagated to the self-heal daemon. But
since the default value is 600 seconds, the threads that do the heal only wake
up after that time. Once it wakes up, subsequent runs do seem to honour the new
heal-timeout value.

On a glusterfs 6.5 setup:
#gluster v create testvol replica 2 127.0.0.2:/home/ravi/bricks/brick{1..2}
force
#gluster v set testvol client-log-level DEBUG
#gluster v start testvol
#gluster v set testvol heal-timeout 5
#tail -f /var/log/glusterfs/glustershd.log|grep finished
You don't see anything in the log yet about the crawls.
But once you manually launch heal, the threads are woken up and further crawls
happen every 5 seconds.
#gluster v heal testvol

Now in glustershd.log:
[2019-08-21 09:55:02.024160] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-0. 
[2019-08-21 09:55:02.024271] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-1.
[2019-08-21 09:55:08.023252] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-1.
[2019-08-21 09:55:08.023358] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-0.
[2019-08-21 09:55:14.024438] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-1.
[2019-08-21 09:55:14.024546] D [MSGID: 0]
[afr-self-heald.c:843:afr_shd_index_healer] 0-testvol-replicate-0: finished
index sweep on subvol testvol-client-0.

Glen, could you check if that works for you? i.e. after setting the
heal-timeout, manually launch heal via `gluster v heal testvol`.

--- Additional comment from Glen K on 2019-08-21 18:15:39 UTC ---

In my steps above, I set the heal-timeout while the self-heal daemon is
stopped:

...
4. Stop the self-heal daemon.
...
15. Set `cluster.heal-timeout` to `60`.
16. Start the self-heal daemon.
...

I would expect that the configuration would certainly take effect after a
restart of the self-heal daemon.

Yes, launching heal manually causes the heal to happen right away, but the
purpose of the test is to verify the heal happens automatically. From a user
perspective, the current behaviour of the heal-timeout setting appears to be at
odds with the "configuration changes take effect without restart" feature; I
think it is reasonable to request that changing the heal-timeout setting
results in the thread sleeps being reset to the new setting.

--- Additional comment from Ravishankar N on 2019-08-22 07:11:53 UTC ---

(In reply to Glen K from comment #3)
> 
> I would expect that the configuration would certainly take effect after a
> restart of the self-heal daemon.

In step-4 and 16, I assume you toggled `cluster.self-heal-daemon` off and on
respectively. This actually does not kill the shd process per se and just
disables/enables the heal crawls. In 6.5, a volume start force does restart shd
so changing the order of the tests should do the trick, i.e.

13. Set `cluster.heal-timeout` to `60`.
14. Force start the volume.
15. Verify that "split-brain" appears in the output of `gluster volume heal
<volume> info` command.

> Yes, launching heal manually causes the heal to happen right away, but the
> purpose of the test is to verify the heal happens automatically. From a user
> perspective, the current behaviour of the heal-timeout setting appears to be
> at odds with the "configuration changes take effect without restart"
> feature; I think it is reasonable to request that changing the heal-timeout
> setting results in the thread sleeps being reset to the new setting.

Fair enough, I'll attempt a fix on master, let us see how the review goes.

--- Additional comment from Worker Ant on 2019-08-22 12:15:15 UTC ---

REVIEW: https://review.gluster.org/23288 (afr: wake up index healer threads)
posted (#1) for review on master by Ravishankar N

--- Additional comment from Worker Ant on 2019-08-30 04:25:40 UTC ---

REVIEW: https://review.gluster.org/23288 (afr: wake up index healer threads)
merged (#4) on master by Ravishankar N

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1743988
[Bug 1743988] Setting cluster.heal-timeout requires volume restart
https://bugzilla.redhat.com/show_bug.cgi?id=1744548
[Bug 1744548] Setting cluster.heal-timeout requires volume restart
https://bugzilla.redhat.com/show_bug.cgi?id=1747301
[Bug 1747301] Setting cluster.heal-timeout requires volume restart
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.