[Gluster-users] Heal-failed - what does it really tell us?
prmarino1 at gmail.com
prmarino1 at gmail.com
Fri Jul 24 00:51:12 UTC 2015
You had a split brain at one point.
RHEV adds a dimension to this which is interesting.
I have run into this before it probably happened during an update to the gluster servers or a sequential restart of the gluster process or servers.
So first thing there is a nasty cron daily job which is created by a package included in the Red Hat base that runs a yum update every day. This is one of the many reasons why my production kickstarts are always nobase installs.
The big reason this happens with RHEV is if a node is rebooted or the gluster server processes are restarted and an other node in a 2 brick cluster has the same thing happen too quickly. Essentially what happens while a self heal operation is happening the second node which is the master source goes offline and instead of fensing the volume the client fails over to the incomplete copy.
The result is actually a split brain but the funny thing when you add RHEV into the mix is every thing keeps working so unless you are using a tool like splunk or a properly configured logwatch cron job on your syslog server you never know any thing is wrong till you restart gluster on one of the servers.
So you did have a split brain you just didn't know it.
The easiest way to prevent this is to have a 3 replica brick structure on your volumes and have tighter controls on when reboots, process restarts, and updates happen.
We have a replica 2, where the second node was freshly added about a
week ago and as fas as I can tell is fully replicated. This is storage
for a RHEV cluster and the total space currently in use is about 3.5TB.
When I run "gluster v heal gluster-rhev info heal-failed" it currently
lists 866 files on the original and 1 file on the recently added node.
What I find most interesting is that the single file listed on the
second node is a lease file belonging to a VM template.
Some obvious questions come to mind: What is that output supposed to
mean? Dose it in fact even have a useful meaning at all? How can the
files be in a heal-failed condition and not also be in a split-brain
condition?
My interpretation of "heal-failed" is that the listed files are not yet
fully in sync across nodes (and are therefore by definition in a
split-brain condition) but that doesn't match the output of the command.
However, that can't be the same as the gluster interpretation because
how can a template file which has received no reads or writes possibly
be in a heal-failed condition a week after the initial volume heal?
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
More information about the Gluster-users
mailing list