[Gluster-users] Exact purpose of network.ping-timeout

Wed Jan 10 06:28:05 UTC 2018

Can this get into 'FAQ' document somewhere? This is one of the major
question asked all the time.

Regards,
Amar

On Wed, Jan 10, 2018 at 10:56 AM, Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

> Sorry about the delayed response. Had to dig into the history to answer
> various "why"s.
>
> ----- Original Message -----
> > From: "Omar Kohl" <omar.kohl at iternity.com>
> > To: gluster-users at gluster.org
> > Sent: Tuesday, December 26, 2017 6:41:48 PM
> > Subject: [Gluster-users] Exact purpose of network.ping-timeout
> >
> > Hi,
> >
> > I have a question regarding the "ping-timeout" option. I have been
> > researching its purpose for a few days and it is not completely clear to
> me.
> > Especially that it is apparently strongly encouraged by the Gluster
> > community not to change or at least decrease this value!
> >
> > Assuming that I set ping-timeout to 10 seconds (instead of the default
> 42)
> > this would mean that if I have a network outage of 11 seconds then
> Gluster
> > internally would have to re-allocate some resources that it freed after
> the
> > 10 seconds, correct? But apart from that there are no negative
> implications,
> > are there? For instance if I'm copying files during the network outage
> then
> > those files will continue copying after those 11 seconds.
> >
> > This means that the only purpose of ping-timeout is to save those extra
> > resources that are used by "short" network outages. Is that correct?
>
> Basic purpose of ping-timer/heartbeat is to identify an unresponsive
> brick. Unresponsiveness can be caused due to various reasons like:
> * A deadlocked server. We no longer see too many instances of deadlocked
> bricks/server
> * Slow execution of fops in brick stack. For eg.,
>     - due to lock contention. There have been some efforts to fix the lock
> contention on brick stack.
>     - bad backend OS/filesystem. Posix health checker was an effort to fix
> this.
>     - Not enough threads for execution etc
>   Note that ideally its not the job of ping framework to identify this
> scenario and following the same thought process we've shielded the
> processing of ping requests on bricks from the costs of execution of
> requests to Glusterfs Program.
>
> * Ungraceful shutdown of network connections. For eg.,
>     - hard shutdown of machine/container/VM running the brick
>     - physically pulling out the network cable
>   Basically all those different scenarios where TCP/IP doesn't get a
> chance to inform the other end that it is going down. Note that some of the
> scenarios of ungraceful network shutdown can be identified using
> TCP_KEEPALIVE and TCP_USERTIMEOUT [1]. However, at the time when heartbeat
> mechanism was introduced in Glusterfs, TCP_KEEPALIVE couldn't identify all
> the ungraceful network shutdown scenarios and TCP_USER_TIMEOUT was yet to
> be implemented in Linux kernel. One scenario which TCP_KEEPALIVE could
> identify was the exact scenario TCP_USER_TIMEOUT aims to solve -
> identifying an hard network shutdown when data is in transit. However there
> might be other limitations in TCP_KEEPALIVE which we need to test out
> before retiring heart beat mechanism in favor of TCP_KEEPALIVE and
> TCP_USER_TIMEOUT.
>
> The next interesting question would be why we need to identify an
> unresponsive brick. Various reasons why we need to do that would be:
> * To replace/fix any problems the brick might have
> * Almost all of the cluster translators - DHT, AFR, EC - wait for a
> response from all of their children - either successful or failure - before
> sending the response back to application. This means one or more
> slow/unresponsive brick can increase the latencies of fops/syscalls even
> though other bricks are responsive and healthy. However there are ongoing
> efforts to minimize the effect of few slow/unresponsive bricks [2]. I think
> principles of [2] can applied to DHT and AFR too.
>
> Some recent discussions on the necessity of ping framework in glusterfs
> can be found at [3].
>
> Having given all the above reasons for the existence of ping framework,
> its also important that ping-framework shouldn't bring down an otherwise
> healthy connection (False positives). Reasons are:
> * As pointed out by Joe Julian in another mail on this thread, each
> connection carries some state on bricks like locks/open-fds which is
> cleaned up on a disconnect. So, disconnects (even those followed by quick
> reconnects) are not completely transient to application. Though presence of
> HA layers like EC/AFR mitigates this problem to some extent, we still don't
> have a lock healing implementation in place. So, once Quorum number of
> AFR/EC children go down (though may not be all at once), locks are no
> longer held on bricks.
> * All the fops that are in transit in the time window starting from the
> time of disconnect till a successful reconnect are failed by rpc/transport
> layer. So, based on the configuration of volumes (whether AFR/EC/DHT
> prevent these errors from being seen by application), this *may* result in
> application seeing the error.
>
> IOW, disconnects are not lightweight and we need to avoid them whenever
> possible. Since the action on ping-timer expiry is to disconnect the
> connection, we suggest not have very low values to avoid spurious
> disconnections.
>
> [1] http://man7.org/linux/man-pages/man7/tcp.7.html
> [2] https://github.com/gluster/glusterfs/issues/366
> [3] http://lists.gluster.org/pipermail/gluster-devel/2017-
> January/051938.html
>
> >
> > If I am confident that my network will not have many 11 second outages
> and if
> > they do occur I am willing to incur those extra costs due to resource
> > allocation is there any reason not to set ping-timeout to 10 seconds?
> >
> > The problem I have with a long ping-timeout is that the Windows Samba
> Client
> > disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
> > shuts down ungracefully then the Samba Client disconnects and the file
> that
> > was being copied is incomplete on the server. These "costs" seem to be
> much
> > higher than the potential costs of those Gluster resource re-allocations.
> > But it is hard to estimate because there is not clear documentation what
> > exactly those Gluster costs are.
> >
> > In general I would be very interested in a comprehensive explanation of
> > ping-timeout and the up- and downsides of setting high or low values for
> it.
> >
> > Kinds regards,
> > Omar
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-users
> >
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>

-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180110/56af1da3/attachment.html>