[Gluster-users] AFR Version used for self-heal

Fri Feb 26 05:48:41 UTC 2016

On 02/26/2016 10:02 AM, Kyle Maas wrote:
> On 02/25/2016 08:20 PM, Ravishankar N wrote:
>> On 02/25/2016 11:36 PM, Kyle Maas wrote:
>>> How can I tell what AFR version a cluster is using for self-heal?
>> If all your servers and clients are 3.7.8, then they are by default
>> running afr-v2.  Afr-v2 was a re-write of afr that went in for 3.6.,
>> so any gluster package from then on has this code, you don't need to
>> explicitly enable anything.
> That was what I thought until I ran across this IRC log where JoeJulian
> asked if it was explicitly enabled:
>
> https://irclog.perlgeek.de/gluster/2015-10-29
>
>>> The reason I ask is that I have a two-node replicated 3.7.8 cluster (no
>>> arbiters) which has locking behavior during self-heal which looks very
>>> similar to that of AFRv1 (only heals one file at a time per self-heal
>>> daemon, appears to lock the full inode while it's healing it instead of
>>> just ranges, etc.),
>>   Both v1 and v2 use range locks while healing a given file, so clients
>> shouldn't block when heals happen. What is the problem you're facing?
>> Are your clients also at 3.7.8?
> Primary symptoms are:
>
> 1. While a self-heal is running, only one file at a time is healed per
> brick.  As I understand it, AFRv2 and up should allow for multiple files
> to be healed concurrently or at least multiple ranges within a file,
> particularly with io-thread-count set to >1.  During a self-heal,
> neither I/O nor network is saturated, which leads me to believe that I'm
> looking at a single synchronous self-healing process.
The self-heal daemon on each node processes one file at a time per 
replica, so in that sense it is serial. We are  working on the 
multi-threaded self heal patch (http://review.gluster.org/#/c/13329/) 
for parallel heals.
>
> 3. More troubling is that during a self-heal, clients cannot so much as
> list the files on the volume until the self-heal is done.  No errors.
> No timeouts.  They just freeze.  As soon as the self-heal is complete,
> they unfreeze and list the contents.
I'm guessing http://review.gluster.org/#/c/13207/ would fix that. But as 
a work around, can you see if  'gluster vol set volname data-self-heal 
off` makes them more responsive?
>
> 4. Any file access during a self-heal also freezes, just like a
> directory listing, until the self-heal is done.
Ditto as above, please see if disabling client-side heal helps.

Regards,
Ravi

> This wreaks havoc on
> users who have files open when one of the bricks is rebooted and has to
> be healed, since with as much data is stored on this cluster, a
> self-heal can take almost 24 hours.
>
> I experience the same problems when I run without any clients other than
> the bricks themselves mounting the volume, so yes, it happens with the
> clients on 3.7.8 as well.
>
> Warm Regards,
> Kyle Maas
>