[Gluster-users] A question about healing

Thu Jul 11 20:44:07 UTC 2013

Hi gurus,

So I have a cluster that I've set up and I'm banging on.  It's comprised
of four machines with two drives in each machine.  (By the way, the
3.2.5 version that comes with stock Ubuntu 12.04 seems to have a lot of
bugs/instability.  I was screwing it up daily just by putting it through
some heavy-use tests.  Then I downloaded 3.3.1 from the PPA, and so far
things seem a LOT more stable.  I haven't managed to break anything yet,
although the night is still young.)

I'm dumping data to it like mad, and I decide to simulate a filesystem
error my remounting half of the cluster's drives in read-only mode with
"mount -o remount,ro".

The cluster seems to slow just slightly, but it kept on ticking.  Great.

Then I remount the drives read-write again.  Everything is still
running, awesome.  I look at the log files and I see lots of stuff like
this:

> [2013-07-11 16:20:13.387798] E [posix.c:1853:posix_open]
> 0-bkupc1-posix: open on
> /export/a/glusterfs/BACKUPS/2013-07-10/cat/export/c/data/dir2/OLD/.dir.old.tar.gz.hdbx3z:
> File exists
> [2013-07-11 16:20:13.387819] I [server3_1-fops.c:1538:server_open_cbk]
> 0-bkupc1-server: 24283714: OPEN
> /BACKUPS/2013-07-10/cat/export/c/data/dir2/OLD/.dir.old.tar.gz.hdbx3z
> (0c079382-e88b-432c-83e3-79bd9f5b8bb9) ==> -1 (File exists)

I believe that this particular file was still open for writing when I
took away write access to half of the cluster.  Once the rsync process
I'm running finishes with that file the errors cease.

Next I started thumbing through the manual looking at the healing
process, and start a heal with "glusterfs volume heal bkupc1 start". 
The manual lists "volume heal <VOLUME> info heal-failed" as a command
that I can check out, so I run that and get a list of files for which
healing has failed.  There's quite a bit of them.

But what exactly does this mean?  The manual doesn't say what steps, if
any, that I should take to fix a failed heal.  Is there anything that I
should do?  Or will this eventually sort itself out?

Thanks for your time,

Michael Peek