[Gluster-devel] fail-over taking too long when a node reboots

Wed Jul 27 11:49:15 UTC 2016

On Wed, Jul 27, 2016 at 12:40:58PM +0530, Pranith Kumar Karampuri wrote:
> hi,
>      Does anyone have complete understanding of keepalive timeout vs TCP
> User timeout (UTO) options? For both afr and EC when the server reboots it
> takes 42 seconds for the fops to fail with ENOTCONN
> (saved_frames_unwind()). I am wondering if there is any way to reduce this
> time by playing with these two options. As per our earlier research on this
> (I think it was kp who did that) keepalive was not getting triggered when
> there are fops in progress and he saw quite a few game-dev forums talk
> about this problem too. It seems like there is a new timeout called TCP
> User timeout which seems to address this. I am wondering if anyone of you
> have any experience with this and suggest defaults to be changed for these
> timeouts which are more meaningful. I think at the moment default is 42
> seconds.

http://review.gluster.org/8065 might be related? More details are in
https://www.gluster.org/pipermail/gluster-devel/2014-May/040755.html and
https://bugzilla.redhat.com/show_bug.cgi?id=1129787

HTH,
Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160727/751f8374/attachment.sig>