[Gluster-users] AFR Version used for self-heal

Fri Feb 26 04:32:44 UTC 2016

On 02/25/2016 08:20 PM, Ravishankar N wrote:
> On 02/25/2016 11:36 PM, Kyle Maas wrote:
>> How can I tell what AFR version a cluster is using for self-heal?
> If all your servers and clients are 3.7.8, then they are by default
> running afr-v2.  Afr-v2 was a re-write of afr that went in for 3.6.,
> so any gluster package from then on has this code, you don't need to
> explicitly enable anything.

That was what I thought until I ran across this IRC log where JoeJulian
asked if it was explicitly enabled:

https://irclog.perlgeek.de/gluster/2015-10-29

>>
>> The reason I ask is that I have a two-node replicated 3.7.8 cluster (no
>> arbiters) which has locking behavior during self-heal which looks very
>> similar to that of AFRv1 (only heals one file at a time per self-heal
>> daemon, appears to lock the full inode while it's healing it instead of
>> just ranges, etc.),
>  Both v1 and v2 use range locks while healing a given file, so clients
> shouldn't block when heals happen. What is the problem you're facing?
> Are your clients also at 3.7.8?

Primary symptoms are:

1. While a self-heal is running, only one file at a time is healed per
brick.  As I understand it, AFRv2 and up should allow for multiple files
to be healed concurrently or at least multiple ranges within a file,
particularly with io-thread-count set to >1.  During a self-heal,
neither I/O nor network is saturated, which leads me to believe that I'm
looking at a single synchronous self-healing process.

3. More troubling is that during a self-heal, clients cannot so much as
list the files on the volume until the self-heal is done.  No errors. 
No timeouts.  They just freeze.  As soon as the self-heal is complete,
they unfreeze and list the contents.

4. Any file access during a self-heal also freezes, just like a
directory listing, until the self-heal is done.  This wreaks havoc on
users who have files open when one of the bricks is rebooted and has to
be healed, since with as much data is stored on this cluster, a
self-heal can take almost 24 hours.

I experience the same problems when I run without any clients other than
the bricks themselves mounting the volume, so yes, it happens with the
clients on 3.7.8 as well.

Warm Regards,
Kyle Maas