[Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Greg Scott GregScott at infrasupport.com
Wed Jul 10 11:10:26 UTC 2013


Brian, I'm not ready to give up just yet.  

>From Rejy:

>  Would not the mount option 'backupvolfile-server=<secondary server> help 
> at mount time, in the case of the primary server not being available ?

Hmmm - this seems to be a step in the right direction.  On both nodes I did:

umount /firewall-scripts

Then on fw1:

[root at chicago-fw1 gregs]# mount -t glusterfs -o backupvolfile-server=192.168.253.2 192.168.253.1:/firewall-scripts /firewall-scripts

And on fw2:

[root at chicago-fw2 ~]#  mount -t glusterfs -o backupvolfile-server=192.168.253.1 192.168.253.2:/firewall-scripts /firewall-scripts

For the test I just ran,  each node still uses its local copy first.  For my application, I'm not super concerned about conflicts between one directory and the other because my /firewall-scripts directory will be read-mostly when this is in production.  And as part of my startup, the node with the lowest IP Address takes itself offline for a few  seconds so the other node detects it's down and can assume the primary role.  That's what put me on to this Gluster behavior in the first place - fw2 could not find its script to take control even though a copy of it was sitting right there on its local disk. 

Anyway, this time with the file system mounted as above, I took fw1 offline and from fw2 did, "ls /firewall-scripts".  This time fw2 waited several seconds and then showed me the directory listing instead of blowing up with an error.   Which seems strange to me since I told fw2 that fw1 is its backupvolfile-server and fw1 went offline.  So the behavior is definitely not intuitive.  

One other detail that may be relevant - I take fw1 offline by inserting a firewall rule that does a REJECT on that interface.  That probably explains the "Connection refused" message in the log extract below.   I can try a different test, changing the rule to DROP so it really really is offline and see what happens.

The log on fw2 looks a little different this time.  This tail was taken after doing an ls from fw2.  Pranith - is this the log you mean?  If so, I can do the tests again and keep a tail -f in a different window when the other node goes offline, so we catch the messages right at that event.  Will this be helpful?  I can send tarballs of the whole log file, but it's huge and finding the key messages seems like a needle in a haystack.  

[root at chicago-fw2 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
[2013-07-10 10:37:59.446481] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (Connection reset by peer)
[2013-07-10 10:37:59.446558] W [socket.c:1962:__socket_proto_state_machine] 0-firewall-scripts-client-0: reading from socket failed. Error (Connection reset by peer), peer (192.168.253.1:49152)
[2013-07-10 10:37:59.447322] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x48) [0x7f8974409b78] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb8) [0x7f8974408028] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8974407f4e]))) 0-firewall-scripts-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2013-07-10 10:37:33.563280 (xid=0x24x)
[2013-07-10 10:37:59.447378] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-firewall-scripts-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-07-10 10:37:59.447716] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x48) [0x7f8974409b78] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb8) [0x7f8974408028] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8974407f4e]))) 0-firewall-scripts-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2013-07-10 10:37:35.949434 (xid=0x25x)
[2013-07-10 10:37:59.447754] W [client-handshake.c:276:client_ping_cbk] 0-firewall-scripts-client-0: timer must have expired
[2013-07-10 10:37:59.447821] I [client.c:2097:client_rpc_notify] 0-firewall-scripts-client-0: disconnected
[2013-07-10 10:38:09.963388] E [socket.c:2157:socket_connect_finish] 0-firewall-scripts-client-0: connection to 192.168.253.1:24007 failed (Connection refused)
[2013-07-10 10:38:09.963493] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:19.988428] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:53.044399] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:54.999683] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:58.010774] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:04.028362] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:07.033038] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:10.044094] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:16.060406] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:19.066521] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:22.077600] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:25.088684] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:28.099805] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:31.110840] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:34.121921] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:37.133003] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:40.144084] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:43.155168] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:46.166228] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:49.177270] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:52.188359] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:55.199451] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
^C

And the log from fw1 looks like this:

[root at chicago-fw1 gregs]# tail /var/log/glusterfs/firewall-scripts.log -f
[2013-07-10 10:36:19.708342] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-10 10:36:19.708372] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-10 10:36:19.720679] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-10 10:36:19.721049] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-10 10:36:19.721291] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-10 10:36:19.722390] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-10 10:36:19.723259] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
[2013-07-10 10:37:47.242308] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (Connection timed out)
[2013-07-10 10:37:47.242385] W [socket.c:1962:__socket_proto_state_machine] 0-firewall-scripts-client-1: reading from socket failed. Error (Connection timed out), peer (192.168.253.2:49152)
[2013-07-10 10:37:47.242462] I [client.c:2097:client_rpc_notify] 0-firewall-scripts-client-1: disconnected
^C



More information about the Gluster-users mailing list