[Gluster-devel] Re: [bug #19614] System crashes when node fails even with xfr

Anand Avati avati at zresearch.com
Fri May 11 18:58:48 UTC 2007


Brent,
 you have observed the reconnection logic right. This effect has
'creeped in' after introducing the non blocking tcp connect
functionality, which, pushes connect to the background if it took more
than N usecs, (the current I/O request is returned failed if the
connect() dint succeed in that shot). by the time the second I/O
request comes the connect would have succeeded and the call goes
through.

this can be 'fixed' by turning the "N usecs" (currently hardcoded in
the code, but I want to make it configurable from the spec soon) in
the transport code. but the flip side of makeing this "N" large is
that if the server is really dead for a long time, all I/O on the dead
transport will be blocked for that period, which can be accumulate to
be quite an inexperience.

But then, that's not the end of it. Reconnection logic is being
redesigned where the reconnection is done proactively (not when I/O is
triggered) when a connection dies.

thanks,
avati


2007/5/11, Brent A Nelson <brent at phys.ufl.edu>:
> I haven't seen this issue since before patch-164.  If it makes sense to
> you that that patch might have fixed it, you can probably consider the
> failed reconnect (requiring kill -9 to glusterfs) closed.  I think I would
> have run across it by now if the issue was still present.  I just tried a
> quick test, and it did fine (except for the issue discussed in the next
> two paragraphs).
>
> Reconnect has been very nice ever since, except for the other issue I
> described (which has been there since the initial reconnect patches, I
> believe): after a disconnect (say, kill and restart glusterfsd), the
> client may not reconnect until the next I/O attempt (which is reasonable).
> That next I/O attempt (such as a df or ls) will trigger a reconnect, but
> the I/O that triggered it will get an error rather than waiting for the
> reconnect to complete, issuing the I/O request, and getting the valid
> result.  The next I/O attempt will work fine.
>
> So, it seems that, if there's an I/O request when nodes that would affect
> the I/O are in a disconnected state, reconnect should be given a moment to
> succeed before returning an error for the I/O.  If the reconnect succeeds,
> go ahead and do the I/O and return the result of it.
>
> Or perhaps there's a better way to handle it?
>
> Thanks,
>
> Brent
>
> On Fri, 11 May 2007, Krishna Srinivas wrote:
>
> > Hi Brent,
> >
> > Did you see that problem again? what was the kind of setup
> > you were using? I am not sure which part of the code might
> > have caused the problem. Further details regarding the setup
> > will help.
> >
> > Thanks
> > Krishna
> >
> > On 5/8/07, Brent A Nelson <brent at phys.ufl.edu> wrote:
> >> I just had two nodes go down (not due to GlusterFS).  The nodes were
> >> mirrors of each other for multiple GlusterFS filesystems (all unify on top
> >> of afr), so the GlusterFS clients were understandably unhappy (one of the
> >> filesystems was 100% served by these two nodes, others were only
> >> fractionally served by the two nodes).  However, when the two server nodes
> >> were brought back up, some of the client glusterfs processes recovered,
> >> while others had to be kill -9'ed so the filesystems could be remounted
> >> (they were blocking df and ls commands).
> >>
> >> I don't know if it's related to the bug below or not, but it looks like
> >> client reconnect after failure isn't 100%...
> >>
> >> This was from a tla checkout from yesterday.
> >>
> >> Thanks,
> >>
> >> Brent
> >>
> >> On Mon, 7 May 2007, Krishna Srinivas wrote:
> >>
> >> > Hi Avati,
> >> >
> >> > There was a bug - when the 1st node went down, it would cause
> >> > problem. This bug might be the same, the bug reporter has
> >> > not given enough details to confirm though. We can move the
> >> > bug to unreproducible or fixed state.
> >> >
> >> > Krishna
> >> >
> >> > On 5/6/07, Anand Avati <INVALID.NOREPLY at gnu.org> wrote:
> >> >>
> >> >> Update of bug #19614 (project gluster):
> >> >>
> >> >>                 Severity:              3 - Normal => 5 - Blocker
> >> >>              Assigned to:                    None => krishnasrinivas
> >> >>
> >> >>     _______________________________________________________
> >> >>
> >> >> Follow-up Comment #1:
> >> >>
> >> >> krishna,
> >> >>   can you confirm if this bug is still lurking?
> >> >>
> >> >>     _______________________________________________________
> >> >>
> >> >> Reply to this item at:
> >> >>
> >> >>   <http://savannah.nongnu.org/bugs/?19614>
> >> >>
> >> >> _______________________________________________
> >> >>   Message sent via/by Savannah
> >> >>   http://savannah.nongnu.org/
> >> >>
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > Gluster-devel mailing list
> >> > Gluster-devel at nongnu.org
> >> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >> >
> >>
> >
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>


-- 
Anand V. Avati





More information about the Gluster-devel mailing list