[Gluster-devel] afr logic

Wed Oct 17 16:01:46 UTC 2007

Hi Kevan,

consistency of afr'ed files is important question as of failures in backend
fs too, afr is a medicine against node failures not backend fs ones (at
least not directly), in the last case files can be changed "legally" in
bypass glusterfs by fsck after a hw/sw failure and the changes have to be
handled for corrupted replica, else reading of the same file can give
different data (especialy for forthcoming load balanced read of replicas).
Fortunately rsync'ing of original must create consistent replica in the case
too (if cluster/stripe under afr works equally with replicas), unfortunately
extended attributes aren't rsync'ed (I tested it) what can be required
during repairing.

It seems glusterfs could try to handle hw/sw failures in backend fs with
checksums in extended attributes and checksums are to be calculated for file
chunks (because one checksum requires full recalculation after
appending/changing of one byte to/in a gigabyte file) in the case glusterfs
has to recalculate checksums of all files on corrupted fs (may be toooo
long, it is the same case with rsync'ing) or get list of corrupted files
from backend fs in some way (e.g. with a flag set by fsck in extended
attributes). May be some kind of distributed raid is a better solution,
first step in the direction was done already by cluster/stripe
(unfortunately one of implementations, DDRaid
http://sources.redhat.com/cluster/ddraid/ by Daniel Phillips seems to be
suspended), perhaps it is too computational/network intensive and raid under
backend fs is the best solution even taking into account disk space
overhead.

I'm very interested to hear thoughts about it from glusterfs developers to
clear my misunderstanding.

Regards, Alexey.

On 10/16/07, Kevan Benson <kbenson at a-1networks.com> wrote:
>
>
> When an afr encounters a file that exists on multiple shares that
> doesn't have the trusted.afr.version set, it sets that attribute for sll
> the files and assumes they contain the same data.
>
> I.e. if you manually create the files on the servers directly and with
> different content, appending to the file through the client will set the
> trusted.afr.version for both files, and append to both files, but the
> files still contain different content (the content from before the
> append).
>
> Now, this would be really hard to replicate without this arbitrary
> example, it would probably require a write fail to all afr subvolumes,
> possibly at different times of the write operation, in which case the
> file content can't be trusted anyways, so it's really not a big deal.  I
> only mention it in case it might not be the desired behavior, and
> because it might be useful to have the first specified afr subvolume
> supply the file to the others in the case that none has the
> trusted.afr.version attribute set in cases of pre-populating the share
> (such as rsyncs from a dynamic source).  The problem is easily mitigated
> (rsync to a single share and trigger a self-heal or rsync to the client
> mount point), I just figured I'd mention it, and that's only required if
> you really NEED pre-population of data.
>
> --
>
> -Kevan Benson
> -A-1 Networks
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>