[Gluster-devel] Re: [bug #19614] System crashes when node fails even with xfr
Brent A Nelson
brent at phys.ufl.edu
Fri May 11 19:37:54 UTC 2007
On Sat, 12 May 2007, Anand Avati wrote:
> Brent,
> you have observed the reconnection logic right. This effect has
> 'creeped in' after introducing the non blocking tcp connect
> functionality, which, pushes connect to the background if it took more
> than N usecs, (the current I/O request is returned failed if the
> connect() dint succeed in that shot). by the time the second I/O
> request comes the connect would have succeeded and the call goes
> through.
>
> this can be 'fixed' by turning the "N usecs" (currently hardcoded in
> the code, but I want to make it configurable from the spec soon) in
> the transport code. but the flip side of makeing this "N" large is
> that if the server is really dead for a long time, all I/O on the dead
> transport will be blocked for that period, which can be accumulate to
> be quite an inexperience.
>
Cool. I agree that the time should be quite short (in case nodes are
still down, that gives you access to what is available without a delay for
each and every request), but it would be nice that it waits a minimal
period for a reconnect to work. User-configurable would be nice. It
would help in my mysterious disconnect case (where all machines are
running fine, it's just that the client/server briefly disconnect,
disrupting the current I/O). It could also help on bad network links.
It's probably not that important in real disconnect cases, though, where a
machine may be down or rebooting.
> But then, that's not the end of it. Reconnection logic is being
> redesigned where the reconnection is done proactively (not when I/O is
> triggered) when a connection dies.
Sounds good. Maybe both could work together?
Thanks,
Brent
More information about the Gluster-devel
mailing list