[Gluster-devel] Spurious disconnections / connectivity loss

Sun Jan 31 00:29:55 UTC 2010

Stephan von Krawczynski wrote:
> On Fri, 29 Jan 2010 18:41:10 +0000
> Gordan Bobic <gordan at bobich.net> wrote:
> 
>> I'm seeing things like this in the logs, coupled with things locking up 
>> for a while until the timeout is complete:
>>
>> [2010-01-29 18:29:01] E 
>> [client-protocol.c:415:client_ping_timer_expired] home2: Server 
>> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
>> [2010-01-29 18:29:01] E 
>> [client-protocol.c:415:client_ping_timer_expired] home2: Server 
>> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
>>
>> The thing is, I know for a fact that there is no network outage of any 
>> sort. All the machines are on a local gigabit ethernet, and there is no 
>> connectivity loss observed anywhere else. ssh sessions going to the 
>> machines that are supposedly "not responding" remain alive and well, 
>> with no lag.
> 
> What you're seeing here is exactly what made us increase the ping-timeout to
> 120.
> To us it is obvious that the keep alive strategy does not cope with minimal
> packet loss. On _every_ network you can see packet loss (read the docs of your
> switch carefully). We had the impression that the strategy implemented is not
> aware of the fact that a lost ping packet is no proof for a disconnected
> server but only a hint for a closer look.

It sounds like there needs to be more heartbeats/minute. 1 packet per 10 
seconds might be a good figure to start with, but I cannot see that even 
1 packet / second would be harmful unless the number of nodes starts to 
get very large, and disconnection should be triggered only after some 
threshold number (certainly > 1) of those get lost in a row. Are there 
options to tune such parameters in the volume spec file?

Gordan