[Gluster-devel] Need inputs for solution for renames + entry self-heal data loss in afr

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Oct 25 09:50:42 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1366818 is the bug I am
referring to in the mail above. (Thanks sankarshan for pointing that I
missed the link :-) )

On Tue, Oct 25, 2016 at 3:14 PM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:

> One of the Red hat QE engineers (Nag Pavan) found a day 1 bug in entry
> self-heal where the file with good data can be replaced with file with bad
> data when renames + self-heal is involved in a particular way.
> Sample steps (From the bz):
> 1) have a plain replica volume with 2 bricks. start the volume and mount
> it.
> 2) mkdir dir && mkdir newdir && touch file1
> 3) bring first brick down
> 4) echo abc > dir/file1
> 5) bring the first brick back up and quickly bring the second brick down
> before self-heal can be triggered.
> 6) do mv dir/file1 newdir/file2 <<--- note that this is empty file.
> Now bring the second brick back up. If entry self-heal of 'dir' happens
> first then it deletes the file1 with content 'abc' now when 'newdir' heal
> happens it leads to creation of empty file and the data in the file is lost.
> Same can be achieved using 'link' + 'unlink' as well.
> The main reason for this problem is that afr entry-self-heal at the moment
> doesn't care completely about link-counts before deleting the final link of
> an inode, so it always does unlink and recreates the file and does data
> heals. In this corner case unlink happens on the good copy of the file and
> we either lose data or get stale data based on what is the data present on
> the sink file.
> Solution we are proposing is the following:
> 1) Posix will maintain a hidden directory '.glusterfs/anoninode'(We can
> call it lost+found as well) directory which will be used by afr/ec for
> keeping the 'inodes' until their names are resolved.
> 2) Both afr and ec when they need to heal a directory and a 'name' has to
> be deleted but on the other bricks if the inode is present, it renames this
> file as  'anoninode/<gfid-of-file/dir>' instead of doing unlink/rmdir on it.
> 3) For files:
>          a) Both afr, ec already has logic to do 'link' instead of new
> file creation if a gfid already exists in the brick. So when a name is
> resolved it does exactly what it does now.
>          b) Self-heal daemon will periodically crawl the first level of
> 'anoninode' directory to make sure it deletes the 'inodes' represented as
> files with gfid-string as names whenever the link count is > 1. It will
> also delete the files if the gfid cease to exist on the other bricks.
> 5) For directories:
>          a) both afr and ec need to perform 'rename' of the
> 'anoninode/dir-gfid' to the name it will be resolved to as part of entry
> self-heal, instead of 'mkdir'.
>          b) If self-heal daemon crawl detects that a directory is deleted
> on the other bricks, then it has to scan the files inside the deleted
> directory and move them into 'anoninode' if the gfid of the file/directory
> exists on the other bricks. Otherwise they can be safely deleted.
> Please let us know if you see any issues with this approach.
> --
> Pranith

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161025/a0b53246/attachment-0001.html>

More information about the Gluster-devel mailing list