Hello Bill,

I will allow the more technical engineers working on replication recovery to respond. However, i am interested in your feedback to the split-brain scenario as i have been thinking about these issues. 

I would seem to me that it could be very easy to minimize split brain-issues within the Gluster platform with some reasonable monitoring and controls...

1. two nodes are replicating
2. heart beats becomes undetectable... for what ever reason
3. a timer starts on the surviving nodes, if heart beat does not return within XXX seconds surviving nodes send a message to a storage operations console and text messages a storage operator that a potential split brain is occurring   
4. The operator determines if both nodes are surviving and serving data
5. operator makes decision to take one of the replicas down eliminating split brain until heart beat issue has been repaired.

although this requires direct operator intervention and processes, it works in inter/intra DC replication scenarios in which a DC has failed and a operator must declare a disaster.   

Do you think from an operational perspective that this process would be an advantage?
I've googled this but didn't really see it addressed. Most posts discuss 
cleaning up after a split brain, etc.

In a simple Replication setup, is there any problem with copying files 
from one of the ACTIVE bricks directly rather than going through the 
client mount.
Does it affect gluster specific locking/healing/writing if the file 
involved is a large log file that may have data sent to it during the copy.

I understand that files should only be modified via the client mount, 
but in a pure read situation such as a backup, keeping that traffic off 
the mount network would be an advantage.


