[Gluster-devel] Need inputs for solution for renames + entry self-heal data loss in afr

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Oct 25 09:44:39 UTC 2016

One of the Red hat QE engineers (Nag Pavan) found a day 1 bug in entry
self-heal where the file with good data can be replaced with file with bad
data when renames + self-heal is involved in a particular way.

Sample steps (From the bz):
1) have a plain replica volume with 2 bricks. start the volume and mount it.
2) mkdir dir && mkdir newdir && touch file1
3) bring first brick down
4) echo abc > dir/file1
5) bring the first brick back up and quickly bring the second brick down
before self-heal can be triggered.
6) do mv dir/file1 newdir/file2 <<--- note that this is empty file.

Now bring the second brick back up. If entry self-heal of 'dir' happens
first then it deletes the file1 with content 'abc' now when 'newdir' heal
happens it leads to creation of empty file and the data in the file is lost.

Same can be achieved using 'link' + 'unlink' as well.

The main reason for this problem is that afr entry-self-heal at the moment
doesn't care completely about link-counts before deleting the final link of
an inode, so it always does unlink and recreates the file and does data
heals. In this corner case unlink happens on the good copy of the file and
we either lose data or get stale data based on what is the data present on
the sink file.

Solution we are proposing is the following:

1) Posix will maintain a hidden directory '.glusterfs/anoninode'(We can
call it lost+found as well) directory which will be used by afr/ec for
keeping the 'inodes' until their names are resolved.
2) Both afr and ec when they need to heal a directory and a 'name' has to
be deleted but on the other bricks if the inode is present, it renames this
file as  'anoninode/<gfid-of-file/dir>' instead of doing unlink/rmdir on it.
3) For files:
         a) Both afr, ec already has logic to do 'link' instead of new file
creation if a gfid already exists in the brick. So when a name is resolved
it does exactly what it does now.
         b) Self-heal daemon will periodically crawl the first level of
'anoninode' directory to make sure it deletes the 'inodes' represented as
files with gfid-string as names whenever the link count is > 1. It will
also delete the files if the gfid cease to exist on the other bricks.
5) For directories:
         a) both afr and ec need to perform 'rename' of the
'anoninode/dir-gfid' to the name it will be resolved to as part of entry
self-heal, instead of 'mkdir'.
         b) If self-heal daemon crawl detects that a directory is deleted
on the other bricks, then it has to scan the files inside the deleted
directory and move them into 'anoninode' if the gfid of the file/directory
exists on the other bricks. Otherwise they can be safely deleted.

Please let us know if you see any issues with this approach.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161025/b1e03bce/attachment.html>

More information about the Gluster-devel mailing list