[Gluster-devel] two-node HA cluster failover test - failed again :(

Wed Apr 9 11:10:20 UTC 2008

Hello all,

After upgrading to 1.3.8pre5, i performed a simple failover test of my
two-node HA Gluster cluster (wherein one of the nodes is unplugged from
the network).  Unfortunately, the results were - once again - absolutely
disastrous. 

After unplugging one of the two nodes, the cluster became incredibly
unstable, and the mountpoint on the client bounced between
non-existant and simply bizarre.  This condition remained even after
plugging the node back onto the network.  Restarting glusterfsd on both
storage nodes did not help at all.

At this point i would be very interested to know if anybody has set up
a functioning two-node HA cluster using AFR, which can withstand one of
the nodes temporarily disappearing.  Is this something Gluster is
designed to do, or am i expecting too much ?

For those following along, a discussion of the first failover test is
available from the gluster-devel archives :
http://lists.gnu.org/archive/html/gluster-devel/2008-04/msg00010.html

The environment is identical as that described by the email linked
above, so i won't describe it again here.  This time, however, i had
full DEBUG logging enabled.  I have made these logs (all 3000+
lines) available on pastebin :
dfsC (node that stayed up) : http://pastebin.ca/978162
dfsD (node that was unplugged) : http://pastebin.ca/978166

As well, i've provided a cut/paste over my user session from the client
perspective (dfsA).  Note that the only thing i did was try to "ls" the
mountpoint.  I also ran "date" a handful of times to provide a point of
reference :
http://pastebin.ca/978149

-- 
Daniel Maher <dma AT witbe.net>