[Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Joe Julian joe at julianfamily.org
Thu Jul 11 19:47:08 UTC 2013


Ok, now I'm intrigued.

btw... when I read your initial email I was on my phone. I only got as 
far as the selinux error before my ADHD got the better of me and I 
thought, "well it says what the problem is right there." Sorry, or I 
would have answered at that time.

As it turns out, reading further that error you're seeing comes from 
glusterfsd.service (not glusterd.service) which shouldn't even be 
enabled unless you're trying to use old legacy volfiles from 3.0. The 
"parsing the volfile failed" was spurious, as you discovered.

As for your current problem...

Are your two machines perhaps connected via crossover cable?

The question comes down to, when you're on 192.168.253.1 and shut down 
192.168.253.2, what prevents .1 from being able to be reached? Is it, 
perhaps, because it's gone offline? Check dmesg. See if you can ping the 
.1 address (when .2 is down) and see if you can telnet to port 24007 on .1.

On 07/11/2013 09:46 AM, Greg Scott wrote:
>> When you first mount your volume, look in the client log and see if it's connecting to both bricks.
>>   I suspect it's not and that the failure is related to firewall settings.
> Logs from both nodes below.  For this test, first I did "umount /firewall-scripts" from both nodes.   Then I did “mount –av” using the default parameters in my fstab file.  I did **not** turn on the backupvolfile-server=<secondary server> for this test.   And then in another window, I did "tail tail /var/log/glusterfs/firewall-scripts.log -f" and you can see the spot where I mounted my file system back up again.
>
> Note that everything works as expected when both nodes are online, so this suggests everyone can see everyone else when things are steady-state.   Also note that backupvolfile-server=<secondary server> changed the behavior - I documented this in an earlier post.
>
>> ...the failure is related to firewall settings.
> No way.   I’m wide open on the interface I’m using for heartbeat and glusterfs.  In my application, I take node fw1 offline by inserting a firewall rule and then getting rid of it a few seconds later.   For testing right now, I just insert the rule by hand, look at a bunch of stuff, then get rid of it later.    But since you brought it up, I cleaned out all firewall rules before doing and logging the mounts below.  Near as I can tell, it looks like everyone can see everyone else.  And the logs look the same to my eye as they did before I dropped all (not relevant) firewall rules.
>
> Log from fw1:
>
> [root at chicago-fw1 ~]#
> [root at chicago-fw1 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
> [2013-07-11 15:51:54.423508] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
> [2013-07-11 15:51:54.423576] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 15:51:54.440124] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
> [2013-07-11 15:51:54.440660] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
> [2013-07-11 15:51:54.440886] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
> [2013-07-11 15:51:54.442235] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
> [2013-07-11 15:51:54.443451] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
> [2013-07-11 16:21:22.729423] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
> [2013-07-11 16:21:22.730976] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7f7a69fee13d] (-->/usr/lib64/libpthread.so.0(+0x33c1607c53) [0x7f7a6a684c53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7f7a6b372e35]))) 0-: received signum (15), shutting down
> [2013-07-11 16:21:22.731040] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
>
>
> Blank space - mount -av below.
>
> [2013-07-11 16:39:36.625696] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.1 /firewall-scripts)
> [2013-07-11 16:39:36.640661] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
> [2013-07-11 16:39:36.640800] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
> [2013-07-11 16:39:36.672416] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
> [2013-07-11 16:39:36.672539] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
> [2013-07-11 16:39:36.674545] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
> [2013-07-11 16:39:36.674667] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
> [2013-07-11 16:39:36.675015] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
> [2013-07-11 16:39:36.686253] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
> Given volfile:
> +------------------------------------------------------------------------------+
>    1: volume firewall-scripts-client-0
>    2:     type protocol/client
>    3:     option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
>    4:     option username de6eacd1-31bc-4bdb-a049-776cd840059e
>    5:     option transport-type tcp
>    6:     option remote-subvolume /gluster-fw1
>    7:     option remote-host 192.168.253.1
>    8: end-volume
>    9:
>   10: volume firewall-scripts-client-1
>   11:     type protocol/client
>   12:     option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
>   13:     option username de6eacd1-31bc-4bdb-a049-776cd840059e
>   14:     option transport-type tcp
>   15:     option remote-subvolume /gluster-fw2
>   16:     option remote-host 192.168.253.2
>   17: end-volume
>   18:
>   19: volume firewall-scripts-replicate-0
>   20:     type cluster/replicate
>   21:     subvolumes firewall-scripts-client-0 firewall-scripts-client-1
>   22: end-volume
>   23:
>   24: volume firewall-scripts-dht
>   25:     type cluster/distribute
>   26:     subvolumes firewall-scripts-replicate-0
>   27: end-volume
>   28:
>   29: volume firewall-scripts-write-behind
>   30:     type performance/write-behind
>   31:     subvolumes firewall-scripts-dht
>   32: end-volume
>   33:
>   34: volume firewall-scripts-read-ahead
>   35:     type performance/read-ahead
>   36:     subvolumes firewall-scripts-write-behind
>   37: end-volume
>   38:
>   39: volume firewall-scripts-io-cache
>   40:     type performance/io-cache
>   41:     subvolumes firewall-scripts-read-ahead
>   42: end-volume
>   43:
>   44: volume firewall-scripts-quick-read
>   45:     type performance/quick-read
>   46:     subvolumes firewall-scripts-io-cache
>   47: end-volume
>   48:
>   49: volume firewall-scripts-open-behind
>   50:     type performance/open-behind
>   51:     subvolumes firewall-scripts-quick-read
>   52: end-volume
>   53:
>   54: volume firewall-scripts-md-cache
>   55:     type performance/md-cache
>   56:     subvolumes firewall-scripts-open-behind
>   57: end-volume
>   58:
>   59: volume firewall-scripts
>   60:     type debug/io-stats
>   61:     option count-fop-hits off
>   62:     option latency-measurement off
>   63:     subvolumes firewall-scripts-md-cache
>   64: end-volume
>
> +------------------------------------------------------------------------------+
> [2013-07-11 16:39:36.698740] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
> [2013-07-11 16:39:36.698974] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
> [2013-07-11 16:39:36.711537] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
> [2013-07-11 16:39:36.711717] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
> [2013-07-11 16:39:36.723116] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2013-07-11 16:39:36.723521] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2013-07-11 16:39:36.723913] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
> [2013-07-11 16:39:36.723995] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 16:39:36.724390] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
> [2013-07-11 16:39:36.724601] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
> [2013-07-11 16:39:36.724730] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
> [2013-07-11 16:39:36.724788] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 16:39:36.737359] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
> [2013-07-11 16:39:36.739297] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
> [2013-07-11 16:39:36.739486] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
> [2013-07-11 16:39:36.740672] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
> [2013-07-11 16:39:36.741820] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
>
> And from fw2:
>
> [root at chicago-fw2 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
> [2013-07-11 15:51:45.499012] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 15:51:45.512667] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
> [2013-07-11 15:51:45.513211] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
> [2013-07-11 15:51:45.513416] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
> [2013-07-11 15:51:45.513538] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
> [2013-07-11 15:51:45.515208] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
> [2013-07-11 15:51:45.516512] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-1
> [2013-07-11 16:21:28.150710] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
> [2013-07-11 16:21:28.154455] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7fa599ad613d] (-->/usr/lib64/libpthread.so.0(+0x3c1b407c53) [0x7fa59a16cc53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fa59ae5ae35]))) 0-: received signum (15), shutting down
> [2013-07-11 16:21:28.154503] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
>
>
> Blank space - this is where I did mount -av
>
> [2013-07-11 16:39:35.100584] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.2 /firewall-scripts)
> [2013-07-11 16:39:35.113481] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
> [2013-07-11 16:39:35.113614] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
> [2013-07-11 16:39:35.147118] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
> [2013-07-11 16:39:35.147313] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
> [2013-07-11 16:39:35.149112] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
> [2013-07-11 16:39:35.149268] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
> [2013-07-11 16:39:35.149390] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
> [2013-07-11 16:39:35.160491] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
> Given volfile:
> +------------------------------------------------------------------------------+
>    1: volume firewall-scripts-client-0
>    2:     type protocol/client
>    3:     option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
>    4:     option username de6eacd1-31bc-4bdb-a049-776cd840059e
>    5:     option transport-type tcp
>    6:     option remote-subvolume /gluster-fw1
>    7:     option remote-host 192.168.253.1
>    8: end-volume
>    9:
>   10: volume firewall-scripts-client-1
>   11:     type protocol/client
>   12:     option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
>   13:     option username de6eacd1-31bc-4bdb-a049-776cd840059e
>   14:     option transport-type tcp
>   15:     option remote-subvolume /gluster-fw2
>   16:     option remote-host 192.168.253.2
>   17: end-volume
>   18:
>   19: volume firewall-scripts-replicate-0
>   20:     type cluster/replicate
>   21:     subvolumes firewall-scripts-client-0 firewall-scripts-client-1
>   22: end-volume
>   23:
>   24: volume firewall-scripts-dht
>   25:     type cluster/distribute
>   26:     subvolumes firewall-scripts-replicate-0
>   27: end-volume
>   28:
>   29: volume firewall-scripts-write-behind
>   30:     type performance/write-behind
>   31:     subvolumes firewall-scripts-dht
>   32: end-volume
>   33:
>   34: volume firewall-scripts-read-ahead
>   35:     type performance/read-ahead
>   36:     subvolumes firewall-scripts-write-behind
>   37: end-volume
>   38:
>   39: volume firewall-scripts-io-cache
>   40:     type performance/io-cache
>   41:     subvolumes firewall-scripts-read-ahead
>   42: end-volume
>   43:
>   44: volume firewall-scripts-quick-read
>   45:     type performance/quick-read
>   46:     subvolumes firewall-scripts-io-cache
>   47: end-volume
>   48:
>   49: volume firewall-scripts-open-behind
>   50:     type performance/open-behind
>   51:     subvolumes firewall-scripts-quick-read
>   52: end-volume
>   53:
>   54: volume firewall-scripts-md-cache
>   55:     type performance/md-cache
>   56:     subvolumes firewall-scripts-open-behind
>   57: end-volume
>   58:
>   59: volume firewall-scripts
>   60:     type debug/io-stats
>   61:     option count-fop-hits off
>   62:     option latency-measurement off
>   63:     subvolumes firewall-scripts-md-cache
>   64: end-volume
>
> +------------------------------------------------------------------------------+
> [2013-07-11 16:39:35.173867] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
> [2013-07-11 16:39:35.174065] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
> [2013-07-11 16:39:35.174377] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
> [2013-07-11 16:39:35.185807] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
> [2013-07-11 16:39:35.197485] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2013-07-11 16:39:35.197740] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2013-07-11 16:39:35.198257] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
> [2013-07-11 16:39:35.198346] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 16:39:35.198546] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
> [2013-07-11 16:39:35.198759] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
> [2013-07-11 16:39:35.198810] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
> [2013-07-11 16:39:35.211534] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
> [2013-07-11 16:39:35.211921] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
> [2013-07-11 16:39:35.212098] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
> [2013-07-11 16:39:35.212234] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
> [2013-07-11 16:39:35.213421] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
> [2013-07-11 16:39:35.214372] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-1
>




More information about the Gluster-users mailing list