[Bugs] [Bug 1216303] Fixes for data self-heal in ec

Fri May 8 22:04:22 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1216303


--- Comment #17 from Anand Avati <aavati at redhat.com> ---
COMMIT: http://review.gluster.org/10691 committed in release-3.7 by Pranith
Kumar Karampuri (pkarampu at redhat.com) 
------
commit fae1e70ff3309d2b64febaafc70abcaa2771ecf0
Author: Pranith Kumar K <pkarampu at redhat.com>
Date:   Thu Apr 16 09:25:31 2015 +0530

    cluster/ec: metadata/name/entry heal implementation for ec

    Metadata self-heal:
    1) Take inode lock in domain 'this->name' on 0-0 range (full file)
    2) perform lookup and get the xattrs on all the bricks
    3) Choose the brick with highest version as source
    4) Setattr uid/gid/permissions
    5) removexattr stale xattrs
    6) Setxattr existing/new xattrs
    7) xattrop with -ve values of 'dirty' and difference of highest and its own
       version values for version xattr
    8) unlock lock acquired in 1)

    Entry self-heal:
    1) take directory lock in domain 'this->name:self-heal' on 'NULL' to
prevent
       more than one self-heal
    2) we take directory lock in domain 'this->name' on 'NULL'
    3) Perform lookup on version, dirty and remember the values
    4) unlock lock acquired in 2)
    5) readdir on all the bricks and trigger name heals
    6) xattrop with -ve values of 'dirty' and difference of highest and its own
       version values for version xattr
    7) unlock lock acquired in 1)

    Name heal:
    1) Take 'name' lock in 'this->name' on 'NULL'
    2) Perform lookup on 'name' and get stat and xattr structures
    3) Build gfid_db where for each gfid we know what subvolumes/bricks have
       a file with 'name'
    4) Delete all the stale files i.e. the file does not exist on more than
       ec->redundancy number of bricks
    5) On all the subvolumes/bricks with missing entry create 'name' with same
       type,gfid,permissions etc.
    6) Unlock lock acquired in 1)
    Known limitation: At the moment with present design, it conservatively
    preserves the 'name' in case it can not decide whether to delete it.  this
can
    happen in the following scenario:
    1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say
A)
    2) rename d1/f1 -> d2/f2 is performed but the rename is successful only on
one
       of the bricks (Lets say B)
    3) Now name self-heal on d1 and d2 would re-create the file on both d1 and
d2
       resulting in d1/f1 and d2/f2.

    Because we wanted to prevent data loss in the case above, the following
    scenario is not healable, i.e. it needs manual intervention:
    1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say
A)
    2) We have two hard links: d1/a, d2/b and another file d3/c even before the
       brick went down
    3) rename d3/c -> d2/b is performed
    4) Now name self-heal on d2/b doesn't heal because d2/b with older gfid
will
       not be deleted.  One could think why not delete the link if there is
       more than 1 hardlink, but that leads to similar data loss issue I
described
       earlier:
    Scenario:
    1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say
A)
    2) We have two hard links: d1/a, d2/b
    3) rename d1/a -> d3/c, d2/b -> d4/d is performed and both the operations
are
       successful only on one of the bricks (Lets say B)
    4) Now name self-heal on the 'names' above which can happen in parallel can
       decide to delete the file thinking it has 2 links but after all the
       self-heals do unlinks we are left with data loss.

    Change-Id: I3a68218a47bb726bd684604efea63cf11cfd11be
    BUG: 1216303
    Signed-off-by: Pranith Kumar K <pkarampu at redhat.com>
    Reviewed-on: http://review.gluster.org/10298
    Reviewed-on: http://review.gluster.org/10691
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Tested-by: NetBSD Build System

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.