[Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Wed Jul 10 10:17:07 UTC 2013

On 07/09/2013 06:47 AM, Greg Scott wrote
> I don't get this.  I have a replicated volume and 2 nodes. My 
> challenge is, when I take one node offline, the other node can no 
> longer access the volume until both nodes are back online again.
> Details:
> I have 2 nodes, fw1 and fw2.   Each node has an XFS file system, 
> /gluster-fw1 on node fw1 and gluster-fw2 no node fw2.   Node fw1 is at 
> IP Address 192.168.253.1.  Node fw2 is at 192.168.253.2.
> I create a gluster volume named firewall-scripts which is a replica of 
> those two XFS file systems.  The volume holds a bunch of config files 
> common to both fw1 and fw2.  The application is an active/standby pair 
> of firewalls and the idea is to keep config files in a gluster volume.
> When both nodes are online, everything works as expected. But when I 
> take either node offline, node fw2 behaves badly:
> [root at chicago-fw2 ~]# ls /firewall-scripts
> ls: cannot access /firewall-scripts: Transport endpoint is not connected
> And when I bring the offline node back online, node fw2 eventually 
> behaves normally again.
> What's up with that?  Gluster is supposed to be resilient and 
> self-healing and able to stand up to this sort of abuse. So I must be 
> doing something wrong.
> Here is how I set up everything -- it doesn't get much simpler than 
> this and my setup is right out the Getting Started Guide but using my 
> own names.
> Here are the steps I followed, all from fw1:
> gluster peer probe 192.168.253.2
> gluster peer status
> Create and start the volume:
> gluster volume create firewall-scripts replica 2 transport tcp 
> 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
> gluster volume start firewall-scripts
> On fw1:
> mkdir /firewall-scripts
> mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
> and add this line to /etc/fstab:
> 192.168.253.1:/firewall-scripts /firewall-scripts glusterfs 
> defaults,_netdev 0 0
> on fw2:
> mkdir /firewall-scripts
> mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
> and add this line to /etc/fstab:
> 192.168.253.2:/firewall-scripts /firewall-scripts glusterfs 
> defaults,_netdev 0 0
> That's it.  That's the whole setup.  When both nodes are online, 
> everything replicates beautifully.  But take one node offline and it 
> all falls apart.
> Here is the output from gluster volume info, identical on both nodes:
> [root at chicago-fw1 etc]# gluster volume info
> Volume Name: firewall-scripts
> Type: Replicate
> Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.253.1:/gluster-fw1
> Brick2: 192.168.253.2:/gluster-fw2
> [root at chicago-fw1 etc]#
> Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see 
> errors like this every couple of seconds:
> [2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 
> 0-firewall-scripts-replicate-0: no subvolumes up
> [2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 
> 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not 
> connected)
> And then when I bring fw1 back online, I see these messages on fw2:
> [2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 
> 0-firewall-scripts-client-0: changing port to 49152 (from 0)
> [2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 
> 0-firewall-scripts-client-0: readv failed (No data available)
> [2013-07-09 01:01:35.018546] I 
> [client-handshake.c:1658:select_server_supported_programs] 
> 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num 
> (1298437), Version (330)
> [2013-07-09 01:01:35.019273] I 
> [client-handshake.c:1456:client_setvolume_cbk] 
> 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, 
> attached to remote volume '/gluster-fw1'.
> [2013-07-09 01:01:35.019356] I 
> [client-handshake.c:1468:client_setvolume_cbk] 
> 0-firewall-scripts-client-0: Server and Client lk-version numbers are 
> not same, reopening the fds
> [2013-07-09 01:01:35.019441] I 
> [client-handshake.c:1308:client_post_handshake] 
> 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they 
> are re-opened
> [2013-07-09 01:01:35.020070] I 
> [client-handshake.c:930:client_child_up_reopen_done] 
> 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - 
> notifying CHILD-UP
> [2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 
> 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' 
> came back up; going online.
> [2013-07-09 01:01:35.020616] I 
> [client-handshake.c:450:client_set_lk_version_cbk] 
> 0-firewall-scripts-client-0: Server lk version = 1
> So how do I make glusterfs survive a node failure, which is the whole 
> point of all this?
>
It looks like the brick processes on fw2 machine are not running and 
hence when fw1 is down, the entire replication process is stalled. can u 
do a ps and get the status of all the gluster processes and ensure that 
the brick process is up on fw2.

Regards
Raghav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130710/dee4ef2b/attachment.html>