[Gluster-devel] Re: Timeout settings and self-healing ? (WAS: HA failover test unsuccessful (inaccessible mountpoint))

Fri Apr 25 23:06:15 UTC 2008

On Wed, Apr 23, 2008 at 3:47 AM, Guido Smit <guido at comlog.nl> wrote:

> Krishna,
>
> I did the test. I killed glusterfsd on one server.
> All tests (ls, df, cp) worked like it should. I didn't even notice any
> difference. Unplugging the cable however, blocked all operations and finally
> after a few minutes
> the transport endpoint message appears.
>
> The problem with TCP/IP is that when you unplug the cable, there is no
messages sent to application's poll() on network. Driver internally tries to
reconnect, and only after a long time. (it was around 10+minutes when we
tested) we get message saying no route to host. But when applications die on
server, or there is a shutdown, the connected nodes get a notification,
hence everything will be smooth. Hence the delay in case of network cable
unplugging.

We came with an work around for managing this delay, that was
'transport-timeout' option, which times out each request after certain time.
The default is '108's now. We kept it as high as this considering few
applications which use mandatory locks, (block the write till a lock gets
freed) can take easily up to 1+minutes for releasing the locks. Users have
the option to set 'transport-timeout' (In client/protocol volume). So, they
can tune it considering the I/O time of their apps.

In our test setups, we could timeout exactly after given transport-timeout
setting, everytime. So, the issue of freezing indefinitely, we couldn't
reproduce.

Regards,
Amar

-- 
Amar Tumballi
Gluster/GlusterFS Hacker
[bulde on #gluster/irc.gnu.org]
http://www.zresearch.com - Commoditizing Super Storage!