[Gluster-devel] preventing gfid-mismatches because of crashes in afr

Fri Mar 14 05:02:21 UTC 2014

----- Original Message -----
> From: "raghav" <rpichai at redhat.com>
> To: gluster-devel at nongnu.org, "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Sent: Wednesday, March 12, 2014 3:23:24 PM
> Subject: Re: [Gluster-devel] preventing gfid-mismatches because of crashes in afr
> 
> On 03/11/2014 09:07 PM, Pranith Kumar Karampuri wrote:
> > hi,
> >
> >     Traditionally afr just remembers which of the directories are good vs
> >     stale in extended attributes and then at the time of self-heal, does
> >     full directory scan and deletes stale entries and creates new entries.
> >     There are two problems with this approach
> > 1) even creating/deleting/renaming one entry requires full scan of the
> > directory.
> > 2) If both bricks crash at the same time while a rename is going on, then
> > it can lead to same-name, different gfid split-brains.
> >     Example:
> >              0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with
> >              gfid-b.
> >              1) user executes rename dir1/a -> dir2/b on the mount
> >              over-writing the original file b.
> >              2) On brick-0 rename succeeds so the end result is dir1 does
> >              not have 'a' and dir2 has file 'b' with gfid-a
> >              3) at this point both the brick processes go down or data
> >              center shutdown happens etc, so brick-1 still has dir1 with
> >              file 'a' with 'gfid-a' and dir2 with file 'b' with 'gfid-b'.
> >              4) Now when both bricks are back up, dir1 can be healed
> >              conservatively where 'a' will be recreated with 'gfid-a' and
> >              heal it from brick-1 to brick-0 (incorrectly undoing the
> >              rename).
> >              5) But for dir2 on brick-0 there is a file 'b' with gfid-a
> >              where as on brick-1 there is a file 'b' with 'gfid-b', afr at
> >              the moment doesn't store any information to figure out which
> >              one is correct.
> >
> > To address this issue, granularity of preop/postop of the entry operations
> > need to be incremented.
> > a filename inside a directory can be uniquely identified by the entry-tuple
> > (parent-gfid, entryname, entry-gfid).
> > Example: For dir2/b in the example above we can represent it as
> > (gfid-of-dir2, b, gfid-b) on brick-1
> >
> > So we need to remember such information for every entry fop along with
> > whether that entry is coming 'in' to the directory or going 'out' of the
> > directory.
> > So in the previous example we would have remembered dir2/b with gfid-b is
> > going out of that directory so that entry could be deleted and dir2/b with
> > gfid-a can be healed from brick-0.
> >
> > The solution that we come up with should have the following functionalities
> > broadly:
> > 1) Given an entry-tuple it should be able to remember that it is going in
> > or out of that directory.
> > 2) Given an existing entry-tuple it should be able to forget it.
> > 3) Given an entry-tuple, we should be able to query if that entry-tuple is
> > going in/out.
> >
> > This is one possible way to address this issue:
> > 0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in
> > that directory and
> > 1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from
> > following path .glusterfs/indices/entry/pargfid/gfid/filename to
> > '.glusterfs/indices/entry/in' as part of pre-op
> > 2) Every time unlink/rmdir happens create a hardlink from following path
> > inside .glusterfs/indices/entry/pargfid/gfid/filename to
> > '.glusterfs/indices/entry/out' as part of pre-op
> > 3) Every time rename happens create the following 2/3 hardlinks
> >     - .glusterfs/indices/entry/old-pargfid/gfid/old-filename to
> >     '.glusterfs/indices/entry/out'
> >     - .glusterfs/indices/entry/new-pargfid/gfid/new-filename to
> >     '.glusterfs/indices/entry/in'
> > and if the destination exists:
> >     - .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename
> >     to '.glusterfs/indices/entry/out'
> > 4) Delete the same files as part of post-op.
> 2 questions:
> 
> 1) How does this approach solve the 1st case where scan would be
> required for the full directory? If we delete these transit files during
> the post op, wouldn't that require a full file scan if one brick is down
> and directory entry operations are done on another brick? (or I am
> missing something here)

Oops, I missed that part. As part of post-op we can ask the xlator to not delete this file.

> 
> 2) During a crash, indices directory itself need not be intact. Would
> that not cause problems if we expect indices to be crash consistent?

In all known journal based filesystems entry operations are ordered, so it will be best effort anyways.

According to POSIX standard entry operations on a directory are supposed to be durable, but most journal based filesystems don't follow that in default mode because of performance. Instead they do this. Posix standard is in the process of getting modified to make ordering as the requirement instead of durability.

Pranith
> 
> Regards
> Raghav
>