[Gluster-devel] two-node HA cluster failover test - failed again :(

Wed Apr 9 11:49:49 UTC 2008

Excerpts from Daniel Maher's message of Wed Apr 09 16:40:20 +0530 2008:
> 
> Hello all,
> 
> After upgrading to 1.3.8pre5, i performed a simple failover test of my
> two-node HA Gluster cluster (wherein one of the nodes is unplugged from
> the network).  Unfortunately, the results were - once again - absolutely
> disastrous. 
> 
> After unplugging one of the two nodes, the cluster became incredibly
> unstable, and the mountpoint on the client bounced between
> non-existant and simply bizarre.  This condition remained even after
> plugging the node back onto the network.  Restarting glusterfsd on both
> storage nodes did not help at all.
> 
> At this point i would be very interested to know if anybody has set up
> a functioning two-node HA cluster using AFR, which can withstand one of
> the nodes temporarily disappearing.  Is this something Gluster is
> designed to do, or am i expecting too much ?

This is definitely something GlusterFS is designed to handle. I've set up
this configuration in our lab and am looking into it. 

> For those following along, a discussion of the first failover test is
> available from the gluster-devel archives :
> http://lists.gnu.org/archive/html/gluster-devel/2008-04/msg00010.html
> 
> The environment is identical as that described by the email linked
> above, so i won't describe it again here.  This time, however, i had
> full DEBUG logging enabled.  I have made these logs (all 3000+
> lines) available on pastebin :
> dfsC (node that stayed up) : http://pastebin.ca/978162
> dfsD (node that was unplugged) : http://pastebin.ca/978166

Is the order of subvolumes for AFR in your server specfiles the same?
Specifically, on dfsC you should have

  subvolumes gfs-dfsD-ds gfs-ds

and on dfsD you should have

  subvolumes gfs-ds gfs-dfsC-ds

Is this the case? If not, failover will not work.

Vikas

-- 
http://vikas.80x25.org/