[Gluster-devel] trusted.glusterfs.version xattr

Thu May 8 20:52:31 UTC 2008

--- Derek Price <derek at ximbiot.com> wrote:
> Okay, I think we may be operating with slightly
> different assumptions about the way things are 
> currently happening, so to start off:

Perhaps, I certainly have been making plenty
of assumptions! :)

> 2.  I had the understanding due to a comment in
> another thread that the only operations that were 
> going across the AFR wire in the case of a 
> rename was a remove and then a create.  If this
> assumption is wrong, somebody please correct me.

I don't know if this is correct, but on a 
normal posix filesystem I believe that the 
reverse would usually happen.  First the 
new name would be linked to the inode 
and then the old name would be unlinked?  
I am indeed assuming that this is what 
glusterfs is doing, but perhaps it is not?

...
> If, however, the only operations permitted on
> directory listings is to present a new list of 
> the current files and directories (basically adds 
> and removes), then your method breaks down because
> if a directory version changes then all children of 
> that directory must be assumed to be changed.  
> For example, in the second FS state shown below, 
> there is no way to distinguish which of the three 
> files has changed.  Given:
> 
> /:2 a/:2 b/:2 c/:4 /A:1
>   	             /B:1
>                    /C:1

Well, you do not appear to be using my 
versioning scheme here, you are not using
the parent ids on creation.  The parent id
versioning scheme would look something 
more like: 

Starting with

 /a/b/c/A    /a/b/c/B    /a/b/c/C
 /:2         /:2         /:2
 a:2/2       a:2/2       a:2/2       
 b:2/2/2     b:2/2/2     b:2/2/2
 c:2/2/2/4   c:2/2/2/5   c:2/2/2/6
 A:2/2/2/4/1 B:2/2/2/4/1 C:2/2/2/6/1

B and C would not be able to have the same
versions as A since c would get bumped 
beteween each add (this is not important 
in this example though, just mentioning it 
to be consistent)

> 	$ cd /a/b/c
> 	$ rm C

 /a/b/c
 /:2
 a:2/2       
 b:2/2/2
 c:2/2/2/7

> 	$ echo new content >C

 /a/b/c
 /:2
 a:2/2       
 b:2/2/2
 c:2/2/2/8
 C:2/2/2/8/1

If we now look at the three files, it is easy
to see which ones have been modified.

In the beginning above we had:
 A:2/2/2/4/1 B:2/2/2/4/1 C:2/2/2/6/1

Now we have:
 A:2/2/2/4/1 B:2/2/2/4/1 C:2/2/2/8/1

This means that C is a completely new
file and is not even a candidate for
a quick sync since more than just
the final # (1) has changed.  Note 
that A and B have not changed, their 
version does not need to be bumped 
simply because c has changed!  Does 
this make more sense?

> renders:
> 
> /:2 a/:2 b/:2 c/:6 /A:1
> 		     /B:1
> 	             /C:1

Again, you are using not the parent ids 
on creation here.

> My solution solves this last problem (if it, in
> fact, even exists), 

As shown above, I believe it does not exist. :)

> though not the efficient rename issue (if it, in
> fact, even exists).  

It does exist, doesn't it.  What is in 
question is whether the parent id solves 
it, right?

> I was leaving the transaction journaling issue
> (basically what you represent is already being 
> maintained on a per-directory basis), as a 
> future update which would be easier to integrate
> once the first problem was solved.

Well, I was not assuming any form of
journaling.  I was assuming that renames 
are efficient and that they consist of
link/unlink sequences which do indeed
need to be serialized to work.  This is
the real question, could the directory 
healing scheme potentially reorder the
link/unlink operation?  If so, than
renames will not work efficiently.  

Let's walk my previous example through
the directory healing process.  Before 
the heal we would have ended with server 
1 like this:

/a/b/c/ -> /a/b/c/file
 /:v4       /:v4
 a:v4/2     a:v4/2
 b:v4/2/2   b:v4/2/1
 c:v4/2/2/1 c:v4/2/2/2
            file:v4/2/2/2/1

server 2 did not get these updates so it
still looks like this:

/   -> /a/  -> /a/b/ -> /a/b/c/ -> /a/b/c/file
/:v1   /:v2    /:v2     /:v2       /:v2
       a:v2/1  a:v2/2   a:v2/2     a:v2/2
               b:v2/2/1 b:v2/2/2   b:v2/2/1
                        c:v2/2/2/1 c:v2/2/2/2
                                   file:v2/2/2/2/1

when server 2 rejoins and the cat happens: 
>  cat /a/b/c/file

/ is out of date so it is updated to:

 /a/b/c/file        ->  /z/b/c/file        

although, this was probably a mistake
in my original scheme since it would
in fact be two steps, not one.  These 
steps would be:

 /a          ->  /a  & /z
 /a & /z     ->  /z

which would have bumped the / version twice.
However, I do not believe this would matter.
Ultimately, on healing the directory would 
simply see the latest of the two versions
which would allow the rename to persist
since the inode for 'a' was never deleted, it
just ended up linked at a different spot,
at z, in /.  

However, this was a simple "nice" rename, we
renamed a file in the same directory, so on 
healing, the inode will not be lost.  However,
what happens if we move a file/directory 
from a high level directory to a lower level 
directory?  I fear that the higher level 
directory will probably get healed before the 
lower level directory which will cause the 
inode to be lost during the heal!  I am not 
sure how this can be solved without a 
journal?  Perhaps someone who understands 
AFR renames could help out here?

Cheers,

-Martin

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ