[Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected

Wed May 14 04:12:54 UTC 2008

Krishna Srinivas wrote:
> On Thu, May 8, 2008 at 9:19 PM, Gerry Reno <greno at verizon.net> wrote:
>   
>> Krishna Srinivas wrote:
>>
>>     
>>> Gerry,
>>>
>>> In your client spec "client-local" does not have any purpose right?
>>>
>>> This is your setup:
>>> server1 and server2 have /home/vmail/mailbrick as storage exports.
>>> on client you have an AFR which connects to server1 and server2.
>>> client mounts it on /home/vmail/mailstore
>>>
>>> Can you try mounting on command line instead of fstab?
>>> When you kill one of the servers, can you see if you see anything
>>> in the log files?
>>>
>>> Also mention "option transport-timeout 5" in the two "client/protocol"
>>> subvolumes. (so the timeout will be 5 secs)
>>>
>>> Thanks
>>> Krishna
>>>
>>>
>>>
>>>       
>>  Two machines.
>>  Each machine has a server storage brick (/home/vmail/mailbrick)
>>  Each machine also has a client (/home/vmail/mailstore)
>>  If one of the machines either crashes or needs to be rebooted then it hangs
>> the client mount on the other machine.
>>
>>  I'll umount the mount from fstab and remount from command line and let you
>> know.
>>     
>
> Also mention "option transport-timeout 5" in the two "client/protocol"
> subvolumes. (so the timeout will be 5 secs)
>
>   
>>  Regards,
>>  Gerry
>>
>>
>>
>>     
>
>   
Ok, I ran some tests:
First, when I started I noticed that on one machine when I did a 'df' 
that I would see two client mounts and on the other machine I would see 
one client mount.  I unmounted the clients from fstab and then changed 
the client.vol to include the option transport-timeout 5.  Then I 
started the clients from the command line.  I see one client mount on 
each machine.  I kill one machine.  The other machine still functions.  
Did this a couple times.  Then I went and left the timeout in the vol 
and just rebooted both machines.  They both came back up and df shows 
two client mounts on both machines.  ps shows two client processes on 
both machines.  I kill one machine again and the other machine still 
functions.   So I was not able to recreate hang.

I check logs and I can see in the log that there are thousands of lines 
like the following over the past weeks in both logs:

2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] 
client2: no proper reply from server, returning ENOTCONN
2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: 
non-blocking connect() returned: 111 (Connection refused)
2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] 
client2: not connected at the moment to submit frame type(1) op(22)
2008-04-26 00:27:55 E [client-protocol.c:3742:client_opendir_cbk] 
client2: no proper reply from server, returning ENOTCONN
2008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr: 
op_ret=-1 op_errno=107
2008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr: 
op_ret=-1 op_errno=24
2008-04-26 00:27:55 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse: 
11084: (34) /example.com/john => -1 (5)
2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: 
non-blocking connect() returned: 111 (Connection refused)
2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] 
client2: not connected at the moment to submit frame type(1) op(34)
2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] 
client2: no proper reply from server, returning ENOTCONN
2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: 
non-blocking connect() returned: 111 (Connection refused)
2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] 
client2: not connected at the moment to submit frame type(1) op(34)
2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] 
client2: no proper reply from server, returning ENOTCONN
2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: 
non-blocking connect() returned: 111 (Connection refused)
2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] 
client2: not connected at the moment to submit frame type(1) op(34)

2008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr: 
(path=/example.com/john/dovecot-uidlist.lock child=client2) op_ret=-1 
op_errno=2
2008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr: 
(path=/example.com/john/dovecot-uidlist.lock child=client1) op_ret=-1 
op_errno=2
2008-04-25 19:47:47 E [fuse-bridge.c:692:fuse_fd_cbk] glusterfs-fuse: 
5775: (12) /example.com/john/dovecot-uidlist.lock => -1 (2)

2008-04-25 13:09:02 W [fuse-bridge.c:402:fuse_entry_cbk] glusterfs-fuse: 
3883: (34) /example.com/gerryreno/dovecot-keywords => 566935 Rehashing 
because st_nlink less than dentry maps
2008-04-25 13:09:02 E [fuse-bridge.c:1140:fuse_unlink] glusterfs-fuse: 
3894: UNLINK /example.com/gerryreno/dovecot-uidlist (fuse_loc_fill() 
returned NULL inode)

Anyway, I wasn't able to see the hang using the transport-timeout.  I'm 
trying to think about why there are two client mounts from fstab 
though.  That seems strange.

Regards,
Gerry