[Gluster-devel] Need sensible default value for detecting unclean client disconnects

Tue May 20 11:30:24 UTC 2014

Hi all,

the last few days I've been looking at a problem [1] where a client 
locks a file over a FUSE-mount, and a 2nd client tries to grab that lock 
too.  It is expected that the 2nd client gets blocked until the 1st 
client releases the lock. This all work as long as the 1st client 
cleanly releases the lock.

Whenever the 1st client crashes (like a kernel panic) or the network is 
split and the 1st client is unreachable, the 2nd client may not get the 
lock until the bricks detect that the connection to the 1st client is 
dead. If there are pending Replies, the bricks may need 15-20 minutes 
until the re-transmissions of the replies have timed-out.

The current default of 15-20 minutes is quite long for a fail-over 
scenario. Relatively recently [2], the Linux kernel got 
a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option 
can be used to configure a per-socket timeout, instead of a system-wide 
configuration through the net.ipv4.tcp_retries2 sysctl.

The default network.ping-timeout is set to 42 seconds. I'd like to 
propose a network.tcp-timeout option that can be set per volume. This 
option should then set TCP_USER_TIMEOUT for the socket, which causes 
re-transmission failures to be fatal after the timeout has passed.

Now the remaining question, what shall be the default timeout in seconds 
for this new network.tcp-timeout option? I'm currently thinking of 
making it high enough (like 5 minutes) to prevent false positives.

Thoughts and comments welcome,
Niels

1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460
2 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7