[Gluster-users] A few queries on self-healing and AFR (glusterfs 3.4.2)

Thu Feb 5 12:30:13 UTC 2015

Thank you, Krutika. We are currently planning to migrate our system to 
3.5.3. Should be done in a month. 

If you look at my follow up mail, though, and also at 
http://www.gluster.org/pipermail/gluster-users/2015-February/020519.html, 
which is another thread I started some time back, but now find out that 
they're basically the same problem.

The problem, what I found out was this: I have the following setup:

> > Volume Name: replicated_vol
> > Type: Replicate
> > Volume ID: 26d111e3-7e4c-479e-9355-91635ab7f1c2
> > Status: Started
> > Number of Bricks: 1 x 2 = 2
> > Transport-type: tcp
> > Bricks:
> > Brick1: serv0:/mnt/bricks/replicated_vol/brick
> > Brick2: serv1:/mnt/bricks/replicated_vol/brick
> > Options Reconfigured:
> > diagnostics.client-log-level: INFO
> > network.ping-timeout: 10
> > nfs.enable-ino32: on
> > cluster.self-heal-daemon: on
> > nfs.disable: off

replicated_vol is mounted using mount.glusterfs at /mnt/replicated_vol on 
both servers. I found out using `netstat` that while the mount client 
(usr/sbin/glusterfs) on serv1 was connection to three ports (local 
glusterd, and local and remote glusterfsd), the mount client on serv0 was 
connected only to the local glusterfsd and glusterd. In effect, none of 
the write requests serviced by the mount client on serv0 were not being 
sent to glusterfsd on the serv1. All writes were being transferred to 
serv1 from serv0 only later by the shd once every cluster.heal-timeout.

More investigation revealed the following: mount-client on serv0 had stale 
port information about the listen port of glusterfsd on serv1. On Jan 30 
serv1 underwent a reboot, following which the brick-port on it changed but 
the mount client on serv0 was never made aware about it and continued to 
attempt connection on the old port number every 3 seconds (also filling up 
my /var/log in the process).

More technical details may be found in the email link that I pasted above. 
I'd greatly appreciate some advice on what should be the next thing to 
look for. Also, we do not have a firewall on our servers - they're only 
test setups and not downright prod.. 

Thanks again,
Anirban

From:   Krutika Dhananjay <kdhananj at redhat.com>
To:     A Ghoshal <a.ghoshal at tcs.com>
Cc:     gluster-users at gluster.org
Date:   02/05/2015 05:44 PM
Subject:        Re: [Gluster-users] A few queries on self-healing and AFR 
(glusterfs      3.4.2)

From: "A Ghoshal" <a.ghoshal at tcs.com>
To: gluster-users at gluster.org
Sent: Tuesday, February 3, 2015 12:00:15 AM
Subject: [Gluster-users] A few queries on self-healing and AFR (glusterfs  
     3.4.2)

Hello,

I have a replica-2 volume in which I store a large number of files that 
are updated frequently (critical log files, etc). My files are generally 
stable, but one thing that does worry me from time to time is that files 
show up on one of the bricks in the output of gluster v <volname> heal 
info. These entries disappear on their own after a while (I am guessing 
when cluster.heal-timeout expires and another heal by the self-heal daemon 
is triggered). For certain files, this could be a bit of a bother - in 
terms of fault tolerance...
In 3.4.x, even files that are currently undergoing modification will be 
listed in heal-info output. So this could be the reason why the file(s) 
disappear from the output after a while, in which case reducing 
cluster.heal-timeout might not solve the problem. Since 3.5.1, heal-info 
_only_ reports those files which are truly undergoing heal.

I was wondering if there is a way I could force AFR to return 
write-completion to the application only _after_ the data is written to 
both replicas successfully (kind of, like, atomic writes) - even if it 
were at the cost of performance. This way I could ensure that my bricks 
shall always be in sync. 
AFR has always returned write-completion status to the application only 
_after_ the data is written to all replicas. The appearance of files under 
modification in heal-info output might have led you to think the changes 
have not (yet) been synced to the other replica(s).

The other thing I could possibly do is reduce my cluster.heal-timeout (it 
is 600 currently). Is it a bad idea to set it to something as small as 
say, 60 seconds for volumes where redundancy is a prime concern? 

One question, though - is heal through self-heal daemon accomplished using 
separate threads for each replicated volume, or is it a single thread for 
every volume? The reason I ask is I have a large number of replicated 
file-systems on each volume (17, to be precise) but I do have a reasonably 
powerful multicore processor array and large RAM and top indicates the 
load on the system resources is quite moderate.
There is an infra piece called syncop in gluster using which multiple heal 
jobs are handled by handful of threads. The maximum it can scale up to is 
16 depending on the load. It is safe to assume that there will be one 
healer thread per replica set. But if the load is not too high, just 1 
thread may do all the healing.

-Krutika
Thanks,
Anirban
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150205/bce010ce/attachment.html>