[Bugs] [Bug 1522646] New: Prevent ec from continue processing heal operations after PARENT_DOWN

Wed Dec 6 07:37:32 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1522646

            Bug ID: 1522646
           Summary: Prevent ec from continue processing heal operations
                    after PARENT_DOWN
           Product: GlusterFS
           Version: 3.13
         Component: disperse
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: sheggodu at redhat.com
                CC: aspandey at redhat.com, bugs at gluster.org,
                    jahernan at redhat.com
        Depends On: 1515266
            Blocks: 1505570

+++ This bug was initially created as a clone of Bug #1515266 +++

Description of problem:

EC delays PARENT_DOWN propagation until all pending requests have completed,
but heal operations are ignored. This can cause unexpected results when a heal
operation is running while the volume is being unmounted.

Version-Release number of selected component (if applicable): master

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

--- Additional comment from Ashish Pandey on 2017-11-22 04:51:00 EST ---

Xavi,

This issue is true for afr also. Do you think we can fix it in some common
place 
like syntask infra? Or we have to deal with it separately.

Ashish

--- Additional comment from Worker Ant on 2017-11-22 05:33:32 EST ---

REVIEW: https://review.gluster.org/18840 (cluster/ec: Prevent self-heal to work
after PARENT_DOWN) posted (#1) for review on master by Xavier Hernandez

--- Additional comment from Xavi Hernandez on 2017-11-22 13:23:31 EST ---

(In reply to Ashish Pandey from comment #1)
> Xavi,
> 
> This issue is true for afr also. Do you think we can fix it in some common
> place 
> like syntask infra? Or we have to deal with it separately.
> 
> Ashish

I'm not sure how synctask could help here. It should have access to some
information telling if the xlator that has initiated the operation is shutting
down or not (I don't think we have this). But even then, aborting a single
operation doesn't guarantee that the caller do not attempt another synctask
operation (for example healing the next entry of a directory) still delaying
the shutdown and causing multiple failures on fops that have not really failed
(it will probably add noise to the logs).

I think this is better to be handled inside the xlator itself. If AFR already
tracks ongoing regular operations, I think it would be relatively easy to
include heals in the checks, though I haven't looked at it.

--- Additional comment from Worker Ant on 2017-11-28 04:12:20 EST ---

COMMIT: https://review.gluster.org/18840 committed in master by \"Xavier
Hernandez\" <jahernan at redhat.com> with a commit message- cluster/ec: Prevent
self-heal to work after PARENT_DOWN

When the volume is being stopped, PARENT_DOWN event is received.
This instructs EC to wait until all pending operations are completed
before declaring itself down. However heal operations are ignored
and allowed to continue even after having said it was down.

This may cause unexpected results and crashes.

To solve this, heal operations are considered exactly equal as any
other operation and EC won't propagate PARENT_DOWN until all
operations, including healing, are complete. To avoid big delays
if this happens in the middle of a big heal, a check has been
added to quit current heal if shutdown is detected.

Change-Id: I26645e236ebd115eb22c7ad4972461111a2d2034
BUG: 1515266
Signed-off-by: Xavier Hernandez <jahernan at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1515266
[Bug 1515266] Prevent ec from continue processing heal operations after
PARENT_DOWN
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.