[Gluster-devel] AFR self-heal issues.

Tue Feb 19 02:51:45 UTC 2008

Hi,

== Background ==

We are setting up GlusterFS on a compute cluster. Each node has two
disk partitions /media/gluster1 and /media/gluster2 which are used for
the cluster storage.

We are currently using  builds from TLA (671 as of now)

I have a script to generate GlusterFS client configurations that
create AFR instances over pairs of nodes in the cluster, a snippet
from our current configuration:

# Client definitions
volume client-cn2-1
        type protocol/client
        option transport-type tcp/client
        option remote-host cn2
        option remote-subvolume brick1
end-volume

volume client-cn2-2
        type protocol/client
        option transport-type tcp/client
        option remote-host cn2
        option remote-subvolume brick2
end-volume

volume client-cn3-1
        type protocol/client
        option transport-type tcp/client
        option remote-host cn3
        option remote-subvolume brick1
end-volume

volume client-cn3-2
        type protocol/client
        option transport-type tcp/client
        option remote-host cn3
        option remote-subvolume brick2
end-volume

### snip - you get the idea ###

# Generated AFR volumes
volume afr-cn2-cn3
        type cluster/afr
        subvolumes client-cn2-1 client-cn3-2
end-volume

volume afr-cn3-cn4
        type cluster/afr
        subvolumes client-cn3-1 client-cn4-2
end-volume

### and so on ###

volume unify
        type cluster/unify
        option scheduler rr
        option namespace namespace
        subvolumes  afr-cn2-cn3 afr-cn3-cn4 afr-cn4-cn5 ...
end-volume

== Self healing program ==

I wrote a quick C program (medic) that uses the nftw function and
opens all files in a directory tree, and readlinks all symlinks. This
seems effective at forcing AFR to heal.

== Playing with AFR ==

We have a test cluster of 6 nodes set up.

In this setup, cluster node 2 is involved in 'afr-cn2-cn3' and
'afr-cn7-cn2'.

I copy a large directory tree onto the cluster filesystem (such as
/usr), then 'cripple' node cn2 by deleting the data from its backends
and restarting glusterfsd on that system; to emulate the system going
offline/losing data.

(at this point, all the data is still available on the filesystem)

Running medic over the filesystem mount will now cause the data to be
copied back onto cn2's appropriate volumes and all is happy.

Opening all files on the filesystem seems a stupid waste of time if
you know which volumes have gone down (and when you have over 20TB in
hundreds of thousands of files, that is a considerable waste of time),
so I looked into mounting the parts of the client translator tree into
separate mount points and running medic over those.

 # mkdir /tmp/glfs
 # generate_client_conf > /tmp/glusterfs.vol
 # glusterfs -f /tmp/glusterfs.vol -n afr-cn2-cn3 /tmp/glfs
 # ls /tmp/glfs
    home/
    [Should be: home/ usr/]

A `cd /tmp/glfs/usr/` will succeed and usr/ will be self-healed, but
the contents will not. Likewise a `cat /tmp/glfs/usr/include/stdio.h`
will output the contents of the file and cause it to be self-healed.

Changing the order of the subvolumes to the 'afr-cn2-cn3' volume so
that the up to date client is the first volume causes the directory to
be correctly listed.

This seems to me like a minor-ish bug in cluster/afr's readdir
functionality.

-- Sam Douglas