[Gluster-devel] Spurious disconnections / connectivity loss
Stephan von Krawczynski
skraw at ithnet.com
Sat Jan 30 11:08:29 UTC 2010
On Fri, 29 Jan 2010 18:41:10 +0000
Gordan Bobic <gordan at bobich.net> wrote:
> I'm seeing things like this in the logs, coupled with things locking up
> for a while until the timeout is complete:
>
> [2010-01-29 18:29:01] E
> [client-protocol.c:415:client_ping_timer_expired] home2: Server
> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
> [2010-01-29 18:29:01] E
> [client-protocol.c:415:client_ping_timer_expired] home2: Server
> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
>
> The thing is, I know for a fact that there is no network outage of any
> sort. All the machines are on a local gigabit ethernet, and there is no
> connectivity loss observed anywhere else. ssh sessions going to the
> machines that are supposedly "not responding" remain alive and well,
> with no lag.
What you're seeing here is exactly what made us increase the ping-timeout to
120.
To us it is obvious that the keep alive strategy does not cope with minimal
packet loss. On _every_ network you can see packet loss (read the docs of your
switch carefully). We had the impression that the strategy implemented is not
aware of the fact that a lost ping packet is no proof for a disconnected
server but only a hint for a closer look.
--
Regards,
Stephan
More information about the Gluster-devel
mailing list