[Gluster-devel] preventing gfid-mismatches because of crashes in afr

Wed Mar 12 09:53:24 UTC 2014

On 03/11/2014 09:07 PM, Pranith Kumar Karampuri wrote:
> hi,
>
>     Traditionally afr just remembers which of the directories are good vs stale in extended attributes and then at the time of self-heal, does full directory scan and deletes stale entries and creates new entries. There are two problems with this approach
> 1) even creating/deleting/renaming one entry requires full scan of the directory.
> 2) If both bricks crash at the same time while a rename is going on, then it can lead to same-name, different gfid split-brains.
>     Example:
>              0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with gfid-b.
>              1) user executes rename dir1/a -> dir2/b on the mount over-writing the original file b.
>              2) On brick-0 rename succeeds so the end result is dir1 does not have 'a' and dir2 has file 'b' with gfid-a
>              3) at this point both the brick processes go down or data center shutdown happens etc, so brick-1 still has dir1 with file 'a' with 'gfid-a' and dir2 with file 'b' with 'gfid-b'.
>              4) Now when both bricks are back up, dir1 can be healed conservatively where 'a' will be recreated with 'gfid-a' and heal it from brick-1 to brick-0 (incorrectly undoing the rename).
>              5) But for dir2 on brick-0 there is a file 'b' with gfid-a where as on brick-1 there is a file 'b' with 'gfid-b', afr at the moment doesn't store any information to figure out which one is correct.
>
> To address this issue, granularity of preop/postop of the entry operations need to be incremented.
> a filename inside a directory can be uniquely identified by the entry-tuple (parent-gfid, entryname, entry-gfid).
> Example: For dir2/b in the example above we can represent it as (gfid-of-dir2, b, gfid-b) on brick-1
>
> So we need to remember such information for every entry fop along with whether that entry is coming 'in' to the directory or going 'out' of the directory.
> So in the previous example we would have remembered dir2/b with gfid-b is going out of that directory so that entry could be deleted and dir2/b with gfid-a can be healed from brick-0.
>
> The solution that we come up with should have the following functionalities broadly:
> 1) Given an entry-tuple it should be able to remember that it is going in or out of that directory.
> 2) Given an existing entry-tuple it should be able to forget it.
> 3) Given an entry-tuple, we should be able to query if that entry-tuple is going in/out.
>
> This is one possible way to address this issue:
> 0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in that directory and
> 1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from following path .glusterfs/indices/entry/pargfid/gfid/filename to '.glusterfs/indices/entry/in' as part of pre-op
> 2) Every time unlink/rmdir happens create a hardlink from following path inside .glusterfs/indices/entry/pargfid/gfid/filename to '.glusterfs/indices/entry/out' as part of pre-op
> 3) Every time rename happens create the following 2/3 hardlinks
>     - .glusterfs/indices/entry/old-pargfid/gfid/old-filename to '.glusterfs/indices/entry/out'
>     - .glusterfs/indices/entry/new-pargfid/gfid/new-filename to '.glusterfs/indices/entry/in'
> and if the destination exists:
>     - .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename to '.glusterfs/indices/entry/out'
> 4) Delete the same files as part of post-op.
2 questions:

1) How does this approach solve the 1st case where scan would be 
required for the full directory? If we delete these transit files during 
the post op, wouldn't that require a full file scan if one brick is down 
and directory entry operations are done on another brick? (or I am 
missing something here)

2) During a crash, indices directory itself need not be intact. Would 
that not cause problems if we expect indices to be crash consistent?

Regards
Raghav