[Gluster-users] Exact purpose of network.ping-timeout

Fri Dec 29 00:08:45 UTC 2017

The reason for the long (42 second) ping-timeout is because re-establishing fd's and locks can be a very expensive operation. With an average MTBF of 45000 hours for a server, even just a replica 2 would result in a 42 second MTTR every 2.6 years, or 6 nines of uptime.

On December 27, 2017 3:17:01 AM PST, Omar Kohl <omar.kohl at iternity.com> wrote:
>Hi,
>
>> If you set it to 10 seconds, and a node goes down, you'll see a 10
>seconds freez in all I/O for the volume.
>
>Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
>
>As I said before the problem with the 42 seconds is that a Windows
>Samba Client will disconnect (and therefore interrupt any read/write
>operation) after waiting for about 25 seconds. So 42 seconds is too
>high. In this case it would therefore make more sense to reduce the
>ping-timeout, right?
>
>Has anyone done any performance measurements on what the implications
>of a low ping-timeout are? What are the costs of "triggering heals all
>the time"?
>
>On a related note I found the
>extras/hook-scripts/start/post/S29CTDBsetup.sh script that mounts a
>CTDB (Samba) share and explicitly sets the ping-timeout to 10 seconds.
>There is a comment saying: "Make sure ping-timeout is not default for
>CTDB volume". Unfortunately there is no explanation in the script, in
>the commit or in the Gerrit review history
>(https://review.gluster.org/#/c/7569/,
>https://review.gluster.org/#/c/8007/) for WHY you make sure
>ping-timeout is not default. Can anyone tell me the reason?
>
>Kind regards,
>Omar
>
>-----Ursprüngliche Nachricht-----
>Von: gluster-users-bounces at gluster.org
>[mailto:gluster-users-bounces at gluster.org] Im Auftrag von
>lemonnierk at ulrar.net
>Gesendet: Dienstag, 26. Dezember 2017 22:05
>An: gluster-users at gluster.org
>Betreff: Re: [Gluster-users] Exact purpose of network.ping-timeout
>
>Hi,
>
>It's just the delay for which a node can stop responding before being
>marked as down.
>Basically that's how long a node can go down before a heal becomes
>necessary to bring it back.
>
>If you set it to 10 seconds, and a node goes down, you'll see a 10
>seconds freez in all I/O for the volume. That's why you don't want it
>too high (having a 2 minutes freez on I/O for example would be pretty
>bad, depending on what you host), but you don't want it too low either
>(to avoid triggering heals all the time).
>
>You can configure it because it depends on what you host. You might be
>okay with a few minutes freez to avoid a heal, or you might not care
>about heals at all and prefer a very low value to avoid feezes.
>The default value should work pretty well for most things though
>
>On Tue, Dec 26, 2017 at 01:11:48PM +0000, Omar Kohl wrote:
>> Hi,
>> 
>> I have a question regarding the "ping-timeout" option. I have been
>researching its purpose for a few days and it is not completely clear
>to me. Especially that it is apparently strongly encouraged by the
>Gluster community not to change or at least decrease this value!
>> 
>> Assuming that I set ping-timeout to 10 seconds (instead of the
>default 42) this would mean that if I have a network outage of 11
>seconds then Gluster internally would have to re-allocate some
>resources that it freed after the 10 seconds, correct? But apart from
>that there are no negative implications, are there? For instance if I'm
>copying files during the network outage then those files will continue
>copying after those 11 seconds.
>> 
>> This means that the only purpose of ping-timeout is to save those
>extra resources that are used by "short" network outages. Is that
>correct?
>> 
>> If I am confident that my network will not have many 11 second
>outages and if they do occur I am willing to incur those extra costs
>due to resource allocation is there any reason not to set ping-timeout
>to 10 seconds?
>> 
>> The problem I have with a long ping-timeout is that the Windows Samba
>Client disconnects after 25 seconds. So if one of the nodes of a
>Gluster cluster shuts down ungracefully then the Samba Client
>disconnects and the file that was being copied is incomplete on the
>server. These "costs" seem to be much higher than the potential costs
>of those Gluster resource re-allocations. But it is hard to estimate
>because there is not clear documentation what exactly those Gluster
>costs are.
>> 
>> In general I would be very interested in a comprehensive explanation
>of ping-timeout and the up- and downsides of setting high or low values
>for it.
>> 
>> Kinds regards,
>> Omar
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171228/9c53c221/attachment.html>