[Gluster-devel] Re: [bug #19614] System crashes when node fails even with xfr

Fri May 11 19:37:54 UTC 2007

On Sat, 12 May 2007, Anand Avati wrote:

> Brent,
> you have observed the reconnection logic right. This effect has
> 'creeped in' after introducing the non blocking tcp connect
> functionality, which, pushes connect to the background if it took more
> than N usecs, (the current I/O request is returned failed if the
> connect() dint succeed in that shot). by the time the second I/O
> request comes the connect would have succeeded and the call goes
> through.
>
> this can be 'fixed' by turning the "N usecs" (currently hardcoded in
> the code, but I want to make it configurable from the spec soon) in
> the transport code. but the flip side of makeing this "N" large is
> that if the server is really dead for a long time, all I/O on the dead
> transport will be blocked for that period, which can be accumulate to
> be quite an inexperience.
>

Cool.  I agree that the time should be quite short (in case nodes are 
still down, that gives you access to what is available without a delay for 
each and every request), but it would be nice that it waits a minimal 
period for a reconnect to work.  User-configurable would be nice.  It 
would help in my mysterious disconnect case (where all machines are 
running fine, it's just that the client/server briefly disconnect, 
disrupting the current I/O).  It could also help on bad network links. 
It's probably not that important in real disconnect cases, though, where a 
machine may be down or rebooting.

> But then, that's not the end of it. Reconnection logic is being
> redesigned where the reconnection is done proactively (not when I/O is
> triggered) when a connection dies.

Sounds good.  Maybe both could work together?

Thanks,

Brent