[Gluster-users] bad replication event....

Tue May 28 22:27:32 UTC 2013

After following the instructions to convert a distributed volume to a
distributed/replicated volume running V3.3, we found we were getting a very
large number of "possible split-brain" errors all over the place in many of
our Gluster directories.   Somewhere in this process, some files were
actually lost.   The error logs are long, and finding exactly what happened
is difficult.   But this was in a 70TB 3 (or 6 if you want to count
replication) -brick cluster.   I am beginning to feel that rsync within an
infiniband-based network is the safest
way to replicate (I'm not so worried about bricks going offline)  and I had
been quite happy with the Gluster performance before I decided to be
"responsible" and replicate.....   All of that being said, can anyone
answer this question:

I changed the Gluster mount to NFS-based instead of native client based.
 I don't know exactly what to do to /force/ Gluster to leave itself alone
-- mount all of the underlying bricks RO on their home systems?

*Is there any orderly way to tell a Gluster volume -- I want you to stop
replicating and revert to distributed only?*   Then I can let rsync do its
job while still allowing readonly access to the distributed cluster.   In
the meantime, I am seeing self-heal errors on the client machine that look
like this:

[2013-05-23 08:08:57.240567] E
[afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1:
Unable to self-heal permissions/ownership of '/cfce1/data' (possible
split-brain). Please fix the file on all backend volumes
[2013-05-23 08:08:57.240959] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-gf1-replicate-1: background  meta-data entry self-heal failed on
/cfce1/data
[2013-05-23 08:09:16.900705] E
[afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1:
Unable to self-heal permissions/ownership of '/cfce1/data' (possible
split-brain). Please fix the file on all backend volumes
[2013-05-23 08:09:16.901100] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-gf1-replicate-1: background  meta-data entry self-heal failed on
/cfce1/data
[2013-05-23 08:09:20.654338] E
[afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1:
Unable to self-heal permissions/ownership of '/cfce1/data' (possible
split-brain). Please fix the file on all backend volumes
[2013-05-23 08:09:20.654695] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-gf1-replicate-1: background  meta-data entry self-heal failed on
/cfce1/data
[2013-05-23 08:09:25.829579] E
[afr-self-heal-metadata.c:472:afr_sh_metadata_fix] 0-gf1-replicate-1:
Unable to self-heal permissions/ownership of '/cfce1/data' (possible
split-brain). Please fix the file on all backend volumes
[2013-05-23 08:09:25.829973] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-gf1-replicate-1: background  meta-data entry self-heal failed on
/cfce1/data

"data" is a great huge directory, but at least it's replaceable.   After
actually losing files, I'm nervous.
This happened just after a discussion with one of the frequent contributors
who was saying to me "I don't do replication with Gluster because
replication is REALLY difficult."

Can anyone help?

Matt Temple
------
Matt Temple
Director, Research Computing
Dana-Farber Cancer Institute.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130528/64731449/attachment.html>