[Gluster-devel] Spurious disconnections / connectivity loss
Gordan Bobic
gordan at bobich.net
Sun Jan 31 00:29:55 UTC 2010
Stephan von Krawczynski wrote:
> On Fri, 29 Jan 2010 18:41:10 +0000
> Gordan Bobic <gordan at bobich.net> wrote:
>
>> I'm seeing things like this in the logs, coupled with things locking up
>> for a while until the timeout is complete:
>>
>> [2010-01-29 18:29:01] E
>> [client-protocol.c:415:client_ping_timer_expired] home2: Server
>> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
>> [2010-01-29 18:29:01] E
>> [client-protocol.c:415:client_ping_timer_expired] home2: Server
>> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
>>
>> The thing is, I know for a fact that there is no network outage of any
>> sort. All the machines are on a local gigabit ethernet, and there is no
>> connectivity loss observed anywhere else. ssh sessions going to the
>> machines that are supposedly "not responding" remain alive and well,
>> with no lag.
>
> What you're seeing here is exactly what made us increase the ping-timeout to
> 120.
> To us it is obvious that the keep alive strategy does not cope with minimal
> packet loss. On _every_ network you can see packet loss (read the docs of your
> switch carefully). We had the impression that the strategy implemented is not
> aware of the fact that a lost ping packet is no proof for a disconnected
> server but only a hint for a closer look.
It sounds like there needs to be more heartbeats/minute. 1 packet per 10
seconds might be a good figure to start with, but I cannot see that even
1 packet / second would be harmful unless the number of nodes starts to
get very large, and disconnection should be triggered only after some
threshold number (certainly > 1) of those get lost in a row. Are there
options to tune such parameters in the volume spec file?
Gordan
More information about the Gluster-devel
mailing list