[Gluster-users] Node down and volumes unreachable

Mon Feb 17 17:49:05 UTC 2014

Read/write operations hang for long period of time (too long). I've seen it in that state (waiting) for something like 5 minutes, which makes every application fail trying to read or write. These are the Errors I found in the logs in the server A which is still accessible (B was down)

etc-glusterfs-glusterd.vol.log

...
 [2014-01-31 07:56:49.780247] W [socket.c:1512:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Connection timed out), peer (<SERVER_B_IP>:24007)
[2014-01-31 07:58:25.965783] E [socket.c:1715:socket_connect_finish] 0-management: connection to <SERVER_B_IP>:24007 failed (No route to host)
[2014-01-31 08:59:33.923250] I [glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd mgmt, Num (1238433), Version (2)
[2014-01-31 08:59:33.923289] I [glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using Program Peer mgmt, Num (1238437), Version (2)
...

glustershd.log

[2014-01-27 12:07:03.644849] W [socket.c:1512:__socket_proto_state_machine] 0-teoswitch_custom_music-client-1: reading from socket failed. Error (Connection timed out), peer (<SERVER_B_IP>:24010)
[2014-01-27 12:07:03.644888] I [client.c:2090:client_rpc_notify] 0-teoswitch_custom_music-client-1: disconnected
[2014-01-27 12:09:35.553628] E [socket.c:1715:socket_connect_finish] 0-teoswitch_greetings-client-1: connection to <SERVER_B_IP>:24011 failed (Connection timed out)
[2014-01-27 12:10:13.588148] E [socket.c:1715:socket_connect_finish] 0-license_path-client-1: connection to <SERVER_B_IP>:24013 failed (Connection timed out)
[2014-01-27 12:10:15.593699] E [socket.c:1715:socket_connect_finish] 0-upload_path-client-1: connection to <SERVER_B_IP>:24009 failed (Connection timed out)
[2014-01-27 12:10:21.601670] E [socket.c:1715:socket_connect_finish] 0-teoswitch_ivr_greetings-client-1: connection to <SERVER_B_IP>:24012 failed (Connection timed out)
[2014-01-27 12:10:23.607312] E [socket.c:1715:socket_connect_finish] 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 failed (Connection timed out)
[2014-01-27 12:11:21.866604] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_ivr_greetings-replicate-0: Stopping crawl as < 2 children are up
[2014-01-27 12:11:21.867874] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_greetings-replicate-0: Stopping crawl as < 2 children are up
[2014-01-27 12:11:21.868134] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_custom_music-replicate-0: Stopping crawl as < 2 children are up
[2014-01-27 12:11:21.869417] E [afr-self-heald.c:418:_crawl_proceed] 0-license_path-replicate-0: Stopping crawl as < 2 children are up
[2014-01-27 12:11:21.869659] E [afr-self-heald.c:418:_crawl_proceed] 0-upload_path-replicate-0: Stopping crawl as < 2 children are up
[2014-01-27 12:12:53.948154] I [client-handshake.c:1636:select_server_supported_programs] 0-teoswitch_greetings-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330)
[2014-01-27 12:12:53.952894] I [client-handshake.c:1433:client_setvolume_cbk] 0-teoswitch_greetings-client-1: Connected to <SERVER_B_IP>:24011, attached to remote volume

nfs.log  there are lots of errors but the one that insist most Is this:

[2014-01-27 12:12:27.136033] E [socket.c:1715:socket_connect_finish] 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 failed (Connection timed out)

Any ideas? From the logs I see nothing but confirm the fact that A cannot reach B which makes sense since B is down. But A is not, and it's volume should still be accesible. Right?

Regards,
Marco

Marco Zanger
Phone 54 11 5299-5400 (int. 5501)
Clay 2954, C1426DLD, Buenos Aires, Argentina
Think Green - Please do not print this email unless you really need to

-----Original Message-----
From: Vijay Bellur [mailto:vbellur at redhat.com] 
Sent: lunes, 17 de febrero de 2014 01:21 p.m.
To: Marco Zanger; gluster-users at gluster.org
Subject: Re: [Gluster-users] Node down and volumes unreachable

On 02/13/2014 08:06 PM, Marco Zanger wrote:
> Hi all,
>
> I'm experiencing a strange issue related to both distribute and 
> replicate volumes. The problem is this:
>
> I have two servers, A and B. Both share some replicate volumes and 
> distribute volumes, like this:
>
> Volume Name: upload_path
>
> Type: Replicate
>
> Volume ID: 15ca11e2-206e-414d-8299-3ae20c54bd8a
>
> Status: Started
>
> Number of Bricks: 1 x 2 = 2
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: <IP-A>:<some_path>/upload_path
>
> Brick2: <IP-B>: <some_path>/upload_path
>
> Each server mounts to self like this. In server A:
>
> glusterfs#<IP_A>:upload_path on <some_path>/upload_path type fuse
> (rw,default_permissions,allow_other,max_read=131072)
>
> I've used both glusterfs and nfs for my tests, but when server B is 
> down (unreachable from A) we cannot access (nor read or write) the 
> volumes within A.

By inaccessible state, do you refer to read/write operations hanging or erroring out? Does it stay forever in this inaccessible state? If you check your client log files around the time server B is unreachable from A, there might be some clues around this behavior.

-Vijay