[Gluster-users] Node down and volumes unreachable

Tue Feb 18 20:44:18 UTC 2014

The Log of that particular volume says:

[2014-02-18 09:43:17.136182] W [socket.c:410:__socket_keepalive] 0-socket: failed to set keep idle on socket 8
[2014-02-18 09:43:17.136285] W [socket.c:1876:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported
[2014-02-18 09:43:18.343409] I [server-handshake.c:571:server_setvolume] 0-teoswitch_default_storage-server: accepted client from xxxxx55.domain.com-2075-2014/02/18-09:43:14:302234-teoswitch_default_storage-client-1-0 (version: 3.3.0)
[2014-02-18 09:43:21.356302] I [server-handshake.c:571:server_setvolume] 0-teoswitch_default_storage-server: accepted client from xxxxx54. domain.com-9651-2014/02/18-09:42:00:141779-teoswitch_default_storage-client-1-0 (version: 3.3.0)
[2014-02-18 10:38:26.488333] W [socket.c:195:__socket_rwv] 0-tcp.teoswitch_default_storage-server: readv failed (Connection timed out)
[2014-02-18 10:38:26.488431] I [server.c:685:server_rpc_notify] 0-teoswitch_default_storage-server: disconnecting connectionfrom xxxxx54.hexacta.com-9651-2014/02/18-09:42:00:141779-teoswitch_default_storage-client-1-0
[2014-02-18 10:38:26.488494] I [server-helpers.c:741:server_connection_put] 0-teoswitch_default_storage-server: Shutting down connection xxxxx54.hexacta.com-9651-2014/02/18-09:42:00:141779-teoswitch_default_storage-client-1-0
[2014-02-18 10:38:26.488541] I [server-helpers.c:629:server_connection_destroy] 0-teoswitch_default_storage-server: destroyed connection of xxxxx54.hexacta.com-9651-2014/02/18-09:42:00:141779-teoswitch_default_storage-client-1-0

When I try to access the folder I get.

[root at hxteo55 ~]# ll /<path> /1001/voicemail/
ls: /<path>/1001/voicemail/: Input/output error 

This is the volume info:

Volume Name: teoswitch_default_storage
Type: Distribute
Volume ID: 83c9d6f3-0288-4358-9fdc-b1d062cc8fca
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 12.12.123.54:/<path>/gluster/36779974/teoswitch_default_storage
Brick2: 12.12.123.55:/<path>/gluster/36779974/teoswitch_default_storage

Any ideas?

Marco Zanger
Phone 54 11 5299-5400 (int. 5501)
Clay 2954, C1426DLD, Buenos Aires, Argentina
Think Green - Please do not print this email unless you really need to

-----Original Message-----
From: Vijay Bellur [mailto:vbellur at redhat.com] 
Sent: martes, 18 de febrero de 2014 03:56 a.m.
To: Marco Zanger; gluster-users at gluster.org
Subject: Re: [Gluster-users] Node down and volumes unreachable

On 02/17/2014 11:19 PM, Marco Zanger wrote:
> Read/write operations hang for long period of time (too long). I've 
> seen it in that state (waiting) for something like 5 minutes, which 
> makes every application fail trying to read or write. These are the 
> Errors I found in the logs in the server A which is still accessible 
> (B was down)
>
> etc-glusterfs-glusterd.vol.log
>
> ...
>   [2014-01-31 07:56:49.780247] W 
> [socket.c:1512:__socket_proto_state_machine] 0-management: reading 
> from socket failed. Error (Connection timed out), peer 
> (<SERVER_B_IP>:24007)
> [2014-01-31 07:58:25.965783] E [socket.c:1715:socket_connect_finish] 
> 0-management: connection to <SERVER_B_IP>:24007 failed (No route to 
> host)
> [2014-01-31 08:59:33.923250] I 
> [glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using 
> Program glusterd mgmt, Num (1238433), Version (2)
> [2014-01-31 08:59:33.923289] I 
> [glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using Program Peer mgmt, Num (1238437), Version (2) ...
>
>
> glustershd.log
>
> [2014-01-27 12:07:03.644849] W 
> [socket.c:1512:__socket_proto_state_machine] 
> 0-teoswitch_custom_music-client-1: reading from socket failed. Error 
> (Connection timed out), peer (<SERVER_B_IP>:24010)
> [2014-01-27 12:07:03.644888] I [client.c:2090:client_rpc_notify] 
> 0-teoswitch_custom_music-client-1: disconnected
> [2014-01-27 12:09:35.553628] E [socket.c:1715:socket_connect_finish] 
> 0-teoswitch_greetings-client-1: connection to <SERVER_B_IP>:24011 
> failed (Connection timed out)
> [2014-01-27 12:10:13.588148] E [socket.c:1715:socket_connect_finish] 
> 0-license_path-client-1: connection to <SERVER_B_IP>:24013 failed 
> (Connection timed out)
> [2014-01-27 12:10:15.593699] E [socket.c:1715:socket_connect_finish] 
> 0-upload_path-client-1: connection to <SERVER_B_IP>:24009 failed 
> (Connection timed out)
> [2014-01-27 12:10:21.601670] E [socket.c:1715:socket_connect_finish] 
> 0-teoswitch_ivr_greetings-client-1: connection to <SERVER_B_IP>:24012 
> failed (Connection timed out)
> [2014-01-27 12:10:23.607312] E [socket.c:1715:socket_connect_finish] 
> 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 
> failed (Connection timed out)
> [2014-01-27 12:11:21.866604] E [afr-self-heald.c:418:_crawl_proceed] 
> 0-teoswitch_ivr_greetings-replicate-0: Stopping crawl as < 2 children 
> are up
> [2014-01-27 12:11:21.867874] E [afr-self-heald.c:418:_crawl_proceed] 
> 0-teoswitch_greetings-replicate-0: Stopping crawl as < 2 children are 
> up
> [2014-01-27 12:11:21.868134] E [afr-self-heald.c:418:_crawl_proceed] 
> 0-teoswitch_custom_music-replicate-0: Stopping crawl as < 2 children 
> are up
> [2014-01-27 12:11:21.869417] E [afr-self-heald.c:418:_crawl_proceed] 
> 0-license_path-replicate-0: Stopping crawl as < 2 children are up
> [2014-01-27 12:11:21.869659] E [afr-self-heald.c:418:_crawl_proceed] 
> 0-upload_path-replicate-0: Stopping crawl as < 2 children are up
> [2014-01-27 12:12:53.948154] I 
> [client-handshake.c:1636:select_server_supported_programs] 
> 0-teoswitch_greetings-client-1: Using Program GlusterFS 3.3.0, Num 
> (1298437), Version (330)
> [2014-01-27 12:12:53.952894] I 
> [client-handshake.c:1433:client_setvolume_cbk] 
> 0-teoswitch_greetings-client-1: Connected to <SERVER_B_IP>:24011, 
> attached to remote volume
>
> nfs.log  there are lots of errors but the one that insist most Is this:
>
> [2014-01-27 12:12:27.136033] E [socket.c:1715:socket_connect_finish] 
> 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 
> failed (Connection timed out)
>
> Any ideas? From the logs I see nothing but confirm the fact that A cannot reach B which makes sense since B is down. But A is not, and it's volume should still be accesible. Right?

Nothing very obvious from these logs.

Can you share relevant portions of the client log file? Usually the name of the mount point would be a part of the client log file.

-Vijay