[Gluster-devel] AFR documentation

Thu Oct 18 06:10:14 UTC 2007

Hi All,

There has been lot of mails in the recent times on AFR and HA, I
guess the reason was lack of documentation on self-heal which
has been pending for long. In case any user thinks that his mail
was not responded to or if there was an incomplete response,
please follow it up in the mailing-list or IRC.

There is a change in AFR's functionality regarding the "option
replicate" feature. We realised that design wise it is not good to have this
inside AFR and that it is better to have it outside AFR as a
separate translator. This will not affect the users who have been
using "option replicate *:n" where n is the number of subvols.
For people making use of this option the inconvenience caused is regretted.
The pattern matching translator will be available in 1.4 release.
So the over all functionality is not being compromised.

The reasons for taking "option replicate" feature out of AFR:
* It does not belong there. You can guess by the fact that the
  "option replicate" option has to be exactly same across all the
  AFRs, if it is same then we have to be able to specify it at a
  place common to all AFRs instead of specifying the same
  option in each AFR.
* "option replicate" was making the working of selfheal more
  complicated. We had to come up with  workarounds
  to make it work. In the long run workarounds are not good.

Here is the document which will be put up in the wiki. Any feedback
regarding what should be added or any form of suggestions will be
appreciated.

=========
AFR provides RAID-1 like functionality. AFR replicates files and directories
across the subvolumes. Hence if AFR has four subvolumes, there will be four
copies of all files and directories. AFR provides HA, i.e in case one of the
subvolumes go down (ex. server crash, network disconnection) AFR will still
service the requests from the redundant copies.

AFR also provides self-heal functionality, i.e in case the crashed servers
comeup, the outdated files and directories will be updated with the
latest versions. AFR uses extended attributes of the backend file system
to track the versioning of files and directories to provide the self-heal
feature.

* Note that previously supported "option replicate *html:2,*txt:1" pattern
  matching feature is moved out of AFR. It will be provided as a separate
  translator in 1.4

volume afr-example
  type cluster/afr
  subvolumes brick1 brick2 brick3
end-volume

This sample configuration will replicate all directories and files on brick1,
brick2 and brick3. The subvolumes can be another translator (storage/posix
or protocol/client)

All the read() operations happen from the first alive child. If all the
three subvols are up, read() will be done on brick1, if brick1 is down
read() will be done on brick2. In case read() was being done on brick1
and it goes down, we fallback to brick2 which will be completely
transparent to the user applications.

In 1.4 we will have:
* a feature where user can specify the subvol from which AFR has
  to do read() operations (this will help users who have one of the
  subvols as local storage/posix)
* feature to allow scheduling of read() operations amongst the
  subvols in round-robin fashion.

The order of the subvolumes list should be same across all the AFRs
as they will be used as lock servers. TODO: details on working of locking.

Self-Heal
AFR has self-heal feature, which updates the outdated file and directory
copies by the most recent versions. For example consider the following
config:

volume afr-example
  type cluster/afr
  subvolumes brick1 brick2
end-volume

File self-heal
Now if we create a file foo.txt on afr-example, the file will be created
on brick1 and brick2. The file will have two extended attributes associated
with it in the backend filesystem. One is trusted.afr.createtime and the
other is trusted.afr.version. The trusted.afr.createtime xattr has the
create time (in terms of seconds since epoch) and trusted.afr.version
is a number that is incremented each time a file is modified. This increment
happens during close (incase any write was done before close).

If brick1 goes down, we edit foo.txt the version gets incremented. Now
the brick1 comes back up, when we open() on foo.txt AFR will check if
their versions are same. If they are not same, the outdated copy is
replaced by the latest copy and its version is updated. After the sync
the open() proceeds in the usual manner and the application calling open()
can continue on its access to the file.

If brick1 goes down, we delete foo.txt and create a file with the same
name again i.e foo.txt. Now brick1 comes back up, clearly there is a
chance that the version on brick1 being more than the version on brick2,
this is where createtime extended attribute helps in deciding which
the outdated copy is. Hence we need to consider both createtime and
version to decide on the latest copy.

The version attribute is incremented during the close() call. Version
will not be incremented in case there was no write() done. In case the
fd that the close() gets was got by create() call, we also create
the createtime extended attribute.

Directory self-heal
Suppose brick1 goes down, we delete foo.txt, brick1 comes back up, now
we should not create foo.txt on brick2 but we should delete foo.txt
on brick1. We handle this situation by having the createtime and version
attribute on the directory similar to the file. when lookup() is done
on the directory, we compare the createtime/version attributes of the
copies and see which files needs to be deleted and delete those files
and update the extended attributes of the outdated directory copy.
Each time a directory is modified (a file or a subdirectory is created
or deleted inside the directory) and one of the subvols is down, we
increment the directory's version.

lookup() is a call initiated by the kernel on a file or directory
just before any access to that file or directory. In glusterfs, by
default, lookup() will not be called in case it was called in the
past one second on that particular file or directory.

The extended attributes can be seen in the backend filesystem using
the getfattr command. (getfattr -n trusted.afr.version <file>)
========