[Gluster-users] Heal-failed - what does it really tell us?

Fri Jul 24 00:51:12 UTC 2015

You had a split brain at one point.
RHEV adds a dimension to this‎ which is interesting.
I have run into this before it probably happened during an update to the gluster servers or a sequential restart of the gluster process or servers. 

So first thing there is a nasty cron daily job which is created by a package included in the Red Hat base that runs a yum update every day. This is one of the many reasons why my production kickstarts are always nobase installs.

The big reason this happens with RHEV is if a node is rebooted or the gluster server processes are ‎restarted and an other node in a 2 brick cluster has the same thing happen too quickly. Essentially what happens while a self heal operation is happening the second node which is the master source goes offline and instead of fensing the volume the client fails over to the incomplete copy.
‎The result is actually a split brain‎ but the funny thing when you add RHEV into the mix is every thing keeps working so unless you are using a tool like splunk or a properly configured logwatch cron job on your syslog server you never know any thing is wrong till you restart gluster on one of the servers.

So you did have a split brain you just didn't know it.
The easiest way to prevent this is to have a 3 replica brick structure on your volumes and have tighter controls on when reboots, process restarts, and updates happen.
‎
We have a replica 2, where the second node was freshly added about a 
week ago and as fas as I can tell is fully replicated. This is storage 
for a RHEV cluster and the total space currently in use is about 3.5TB.

When I run "gluster v heal gluster-rhev info heal-failed" it currently 
lists 866 files on the original and 1 file on the recently added node. 
What I find most interesting is that the single file listed on the 
second node is a lease file belonging to a VM template.

Some obvious questions come to mind: What is that output supposed to 
mean? Dose it in fact even have a useful meaning at all? How can the 
files be in a heal-failed condition and not also be in a split-brain 
condition?

My interpretation of "heal-failed" is that the listed files are not yet 
fully in sync across nodes (and are therefore by definition in a 
split-brain condition) but that doesn't match the output of the command. 
However, that can't be the same as the gluster interpretation because 
how can a template file which has received no reads or writes possibly 
be in a heal-failed condition a week after the initial volume heal?

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users