[Gluster-devel] some problems of afr

Wed Aug 12 16:07:29 UTC 2009

In this days, I have test afr as carefully as  I can, and I have some
problems:
1.* The mop stats problem.* Afr has no its own mop->stats function, so it
will use the default stats function, this bring up a problem when gluster is
configured as unify + afr. I have fixed this bug, and I will share the patch
later.

2. *The first_up_child problem.*  As I read the source code and the afr
webpage,  I have found somes of the fops (readdir for example ) use the
first up child of afr to do the action. But it is not true obviouly, given
that an afr made of subvol client0 and client1 where client0 connect to
brick0 of server0, and client1 connect to brick1 of sever1. After hours
running normally, sever0 is down, so client0 lost its connection, and afr
has to use client1 as its first_up_child. When server0 is restored, afr will
use client0 as the first_up_child, and  if the user did not remerber the
files newly created when server0 is down, then some of the files would not
be recoverried by "ls -lsR" until server0's next stop.

My opinion: 1)first_up_child can be replaced by first_reference_child. For
example when afr first startup, the first_reference_child is client0, and
when client0 is stop, it should be client1 even client0 is back again, if
client1 died, it turns to client0. First_reference_child may also have
problems, but I think it can do a better job than the first_up_child. 2)
File recovery should be done automatically when one client is restored, so
you may need some logs for this. This can solve the first_up_child problem
totally.

3.* Wrong  files recovery.*  The problem is very simple to reproduce, your
can create a dir named *DIR* in the gluster root directory (say* /mnt/gl*),
and you can create a file named *FILE* in *DIR* (* /mnt/gl/DIR/FILE*), all
the above actions are done when all the sub-volumes(say client0 and client1)
of afr are up. Then you  turn server0 down, and  do "rm -rf Dir" under*/mnt/gl
*. After that your restore the sever0, and so client0 restored to working,
your do "ls -lsR" under */mnt/gl*, you will see that* DIR* and *FILE *are
steal there.

This problem exists because  gluster failed to remove the DIR in the data
sync proccess when client0 is up, so it recoverid the files in a reverse
way. I have write an routine which removing the files in the directory and
delete the directory finally recursivly. But some times it can prevents the
wrong file recovry, and some times failed. The reason seams complicated, I
will post more later.

Thanks for your attention, I hop that I have describe my view when I was so
desired for sleep.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090813/ebf053b2/attachment-0003.html>