[Gluster-devel] preventing gfid-mismatches because of crashes in afr

Tue Mar 11 15:37:35 UTC 2014

hi,

   Traditionally afr just remembers which of the directories are good vs stale in extended attributes and then at the time of self-heal, does full directory scan and deletes stale entries and creates new entries. There are two problems with this approach
1) even creating/deleting/renaming one entry requires full scan of the directory.
2) If both bricks crash at the same time while a rename is going on, then it can lead to same-name, different gfid split-brains.
   Example:
            0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with gfid-b.
            1) user executes rename dir1/a -> dir2/b on the mount over-writing the original file b.
            2) On brick-0 rename succeeds so the end result is dir1 does not have 'a' and dir2 has file 'b' with gfid-a
            3) at this point both the brick processes go down or data center shutdown happens etc, so brick-1 still has dir1 with file 'a' with 'gfid-a' and dir2 with file 'b' with 'gfid-b'.
            4) Now when both bricks are back up, dir1 can be healed conservatively where 'a' will be recreated with 'gfid-a' and heal it from brick-1 to brick-0 (incorrectly undoing the rename).
            5) But for dir2 on brick-0 there is a file 'b' with gfid-a where as on brick-1 there is a file 'b' with 'gfid-b', afr at the moment doesn't store any information to figure out which one is correct.

To address this issue, granularity of preop/postop of the entry operations need to be incremented.
a filename inside a directory can be uniquely identified by the entry-tuple (parent-gfid, entryname, entry-gfid).
Example: For dir2/b in the example above we can represent it as (gfid-of-dir2, b, gfid-b) on brick-1

So we need to remember such information for every entry fop along with whether that entry is coming 'in' to the directory or going 'out' of the directory.
So in the previous example we would have remembered dir2/b with gfid-b is going out of that directory so that entry could be deleted and dir2/b with gfid-a can be healed from brick-0.

The solution that we come up with should have the following functionalities broadly:
1) Given an entry-tuple it should be able to remember that it is going in or out of that directory.
2) Given an existing entry-tuple it should be able to forget it.
3) Given an entry-tuple, we should be able to query if that entry-tuple is going in/out.

This is one possible way to address this issue:
0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in that directory and
1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from following path .glusterfs/indices/entry/pargfid/gfid/filename to '.glusterfs/indices/entry/in' as part of pre-op
2) Every time unlink/rmdir happens create a hardlink from following path inside .glusterfs/indices/entry/pargfid/gfid/filename to '.glusterfs/indices/entry/out' as part of pre-op
3) Every time rename happens create the following 2/3 hardlinks
   - .glusterfs/indices/entry/old-pargfid/gfid/old-filename to '.glusterfs/indices/entry/out'
   - .glusterfs/indices/entry/new-pargfid/gfid/new-filename to '.glusterfs/indices/entry/in'
and if the destination exists:
   - .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename to '.glusterfs/indices/entry/out'
4) Delete the same files as part of post-op.

To improve upon the solution we can do some optimizations:
Max filename is 255 bytes. And pargfid, gfid can take 16 bytes each.
So
1) If the file that is created/deleted/renamed is <= 223 (filename-max-len(255) - twice-gfid-len(32) = 223) then instead of representing the entry-tuple as pargfid/gfid/filename (i.e. two directories and a filename) it can be represented as modified-filename: pargfidgfidfilename i.e. first 16 bytes pargfid next 16 bytes as gfid and the rest as filename. Instead use this filename as link to 'in', 'out'. (2 mkdirs are saved)
2) If the file that is created/deleted/renamed is <= 249 and > 223 then we can probably use pargfid/gfidfilename as the link. (1 mkdir is saved)

Let me know your thoughts and do let me know If there is an easier way which can satisfy all the functionalities I listed above.

Thanks to Niels for listening to the initial approach and reading the initial draft :-).

Pranith