[Gluster-users] Issue detecting dead peer

Kemp, Joseph A. (JKEMP) JKEMP at arinc.com
Wed Feb 5 18:28:18 UTC 2014

I am running some tests using two kvm hosts each with a centos 6.5 instance running gluster 3.4.2.  The gluster instances are acting both as a gluster server and client,  mounting the gluster volume they are also serving.  During my test there is no file access occurring on the gluster volume.

I am seeing an issue when I forcibly disconnect node1 from the network.  Node2 can take several minutes before it detects node1 is disconnected.  During this time on node2 running "gluster peer status" shows node1 as connected.  The first run of "gluster volume status" takes two minutes to timeout and then returns with no output.  Subsequent runs of "gluster volume status" returns quickly with "Another transaction is in progress. Please try again after sometime."  Eventually "gluster peer status" will show node1 as disconnected.  At that point "gluster volume status" starts to return quickly.

This behavior is only seen when I do a "service network stop" on node1 to simulate a node failure. If I do a "service glusterd stop" on node1 to cleanly shutdown gluster, node2 sees node1 being disconnected immediately.  The volume status commands return immediately.

What is the mechanism for a node to detect a peer has failed?  The delay I am seeing is worrisome to deal with in a production environment.


System Administration
ARINC Direct

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140205/dd481468/attachment.html>

More information about the Gluster-users mailing list