[Gluster-users] One node goes offline, the other node loses its connection to its local Gluster volume

Thu Mar 6 22:18:13 UTC 2014

On 02/22/2014 05:44 PM, Greg Scott wrote:
>
> I have 2 nodes named fw1 and fw2.  When I ifdown the NIC I'm using for 
> Gluster on either node, that node cannot see  its Gluster volume, but 
> the other node can see it after a timeout.  As soon as I ifup that 
> NIC, everyone can see everything again.
>
> Is this expected behavior?  When that interconnect drops, I want both 
> nodes to see their own local copy and then sync everything back up 
> when the interconnect connects again.
>
If a client loses communication on an open tcp connection to a server, 
there is a timeout period (defaults to 42 seconds) where the client 
waits for the communication to continue as dropping and re-establishing 
hundreds to potentially tens of thousands of file descriptors and locks 
is a very expensive process, disruptive to the entire environment.

With the test process you're describing, the clients are connected to 
both servers (hopefully based on hostname resolution) ip addresses on 
the same network. When you down a nic, that address is no longer 
available. Not only can the remote client not connect to it, but your 
local client cannot as well as the address no longer exists.

In your real-life concern, the interconnect would not interfere with the 
existence of either machines' ip address so after the ping-timeout, 
operations would resume in a split-brain configuration. As long as no 
changes were made to the same file on both volumes, when the connection 
is reestablished, the self-heal will do exactly what you expect.

However.... what you're counting on is the most common cause of 
split-brain. Each client connected to one server independently modifies 
the same file. When the connection is reestablished, the self-heal is 
processed and that file is marked as split-brain - inaccessible from the 
client mount until it's resolved by admin intervention.

You can avoid the split-brain using a couple of quorum techniques, the 
one that would seem to satisfy your requirements leaving your volume 
read-only during the duration of the outage.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140306/41a34ab9/attachment.html>