[Gluster-devel] proposals to afr

Krishna Srinivas krishna at zresearch.com
Tue Oct 23 07:25:18 UTC 2007


Alexy,

I could not exactly understand the algorithm and what it is fixing :( can
you explain once again?

AFR does not differentiate bettween its children as master/slaves
when it writes the data, it sends write() operation simultaneously
to all the children.
Lets call it first child, 2nd child etc instead of master/slave.

more inline....

On 10/22/07, Kevan Benson <kbenson at a-1networks.com> wrote:
> Alexey Filin wrote:
> > Hi,
> >
> > may I propose some ideas to be implemented inside afs to increase its
> > reliability?
> >
> > * First idea: an extra extented attribute named e.g. afr_op_counter provides
> > info about operations performed currently over file, so operations changing
> > a file's (meta)data are done in a way:
> >
> > 1) afr_master.increase_afr_op_counter <for file in namespace>
> > 2) real operation over file (meta)data
> > 3) afr_master.start_op -> afr_slave.increase_afr_op_counter <for file on a
> > slave>
> > 4) loop over all slaves by 2)-3)
> >
> > during close():
> >
> > 1) afr_master.zero_op -> afr_slave.zero_afr_op_counter <for file on a slave>
> > 2) loop over all slaves by 1)
> > 3) afr_master.zero_afr_op_counter <for file in namespace>
> >
> > with the scheme all operations finished incorrectly are disclosed in a
> > simple and fast way (with non-zero counter), that scheme is not replacing to
> > afr version xattr, it is a complement allowing to find inconsistent replicas
> > when close() doesn't update the xattr on slaves due to afr master crash
> >
>
> Hmm, sort of like a trusted_afr_version minor number, that gets set
> while in an operation.  Essentially equivalent to taking a file with an
> afr version of 3 and making it 3.5 for the duration of the operation,
> and 4 on close.  Any files on slaves that show they are in an op but no
> operationis actually in place need to be self-healed.  Sounds good to
> me, but then again, I'm not a GlusterFS dev. ;)

What situation are we trying to handle here that is not handled in the
way it works now?

There is one situation, suppose first write fails on 1st child(crash)
and succeeds on
the 2nd child, (second write will not happen on 1st child) second write
fails on the 2nd child(crash), now close will not increment version on
both the children so the data is inconsistent but the version number
remain same. we can handle this, when a write fails from one of the
child, increment version on all the other children so that during the next
open() we sync it.

>
> > * Second idea: afr journal on master (for data or metadata only like in
> > modern local FS's), to keep all updates in it during operations with afr
> > slaves and recover after afr crash
> >
>
> I'm not sure a journal's necessary with self heal.  It would speed up
> recovery of failed processes in some cases, but slow it down in others.
> There should be another copy of the data be the nature of AFR, so self
> heal can recover the problem on a node by the copy operation it does
> currently.  It might be somewhat slower for small operations, but it's
> quite simple and functional.

correct.

>
> As it is now, if a node dies during a write, the files
> trusted_afr_version isn't incremented on that node, and the next read of
> the file when the node is active will overwrite the inconsistent file
> with the good copy from another node.  The client experiences a delay
> while glusterfs waits for the failed node to timeout before it continues
> it's writes, and then continues on.  Besides the delay, node failures
> (and the subsequent automatic repair of the FS) are transparent to the
> client with regard to AFR.

correct.

Regards
Krishna

>
> --
>
> -Kevan Benson
> -A-1 Networks
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>





More information about the Gluster-devel mailing list