[Gluster-devel] Re: [bug #19614] System crashes when node fails even with xfr

Fri May 11 16:50:52 UTC 2007

I haven't seen this issue since before patch-164.  If it makes sense to 
you that that patch might have fixed it, you can probably consider the 
failed reconnect (requiring kill -9 to glusterfs) closed.  I think I would 
have run across it by now if the issue was still present.  I just tried a 
quick test, and it did fine (except for the issue discussed in the next 
two paragraphs).

Reconnect has been very nice ever since, except for the other issue I 
described (which has been there since the initial reconnect patches, I 
believe): after a disconnect (say, kill and restart glusterfsd), the 
client may not reconnect until the next I/O attempt (which is reasonable). 
That next I/O attempt (such as a df or ls) will trigger a reconnect, but 
the I/O that triggered it will get an error rather than waiting for the 
reconnect to complete, issuing the I/O request, and getting the valid 
result.  The next I/O attempt will work fine.

So, it seems that, if there's an I/O request when nodes that would affect 
the I/O are in a disconnected state, reconnect should be given a moment to 
succeed before returning an error for the I/O.  If the reconnect succeeds, 
go ahead and do the I/O and return the result of it.

Or perhaps there's a better way to handle it?

Thanks,

Brent

On Fri, 11 May 2007, Krishna Srinivas wrote:

> Hi Brent,
>
> Did you see that problem again? what was the kind of setup
> you were using? I am not sure which part of the code might
> have caused the problem. Further details regarding the setup
> will help.
>
> Thanks
> Krishna
>
> On 5/8/07, Brent A Nelson <brent at phys.ufl.edu> wrote:
>> I just had two nodes go down (not due to GlusterFS).  The nodes were
>> mirrors of each other for multiple GlusterFS filesystems (all unify on top
>> of afr), so the GlusterFS clients were understandably unhappy (one of the
>> filesystems was 100% served by these two nodes, others were only
>> fractionally served by the two nodes).  However, when the two server nodes
>> were brought back up, some of the client glusterfs processes recovered,
>> while others had to be kill -9'ed so the filesystems could be remounted
>> (they were blocking df and ls commands).
>> 
>> I don't know if it's related to the bug below or not, but it looks like
>> client reconnect after failure isn't 100%...
>> 
>> This was from a tla checkout from yesterday.
>> 
>> Thanks,
>> 
>> Brent
>> 
>> On Mon, 7 May 2007, Krishna Srinivas wrote:
>> 
>> > Hi Avati,
>> >
>> > There was a bug - when the 1st node went down, it would cause
>> > problem. This bug might be the same, the bug reporter has
>> > not given enough details to confirm though. We can move the
>> > bug to unreproducible or fixed state.
>> >
>> > Krishna
>> >
>> > On 5/6/07, Anand Avati <INVALID.NOREPLY at gnu.org> wrote:
>> >>
>> >> Update of bug #19614 (project gluster):
>> >>
>> >>                 Severity:              3 - Normal => 5 - Blocker
>> >>              Assigned to:                    None => krishnasrinivas
>> >>
>> >>     _______________________________________________________
>> >>
>> >> Follow-up Comment #1:
>> >>
>> >> krishna,
>> >>   can you confirm if this bug is still lurking?
>> >>
>> >>     _______________________________________________________
>> >>
>> >> Reply to this item at:
>> >>
>> >>   <http://savannah.nongnu.org/bugs/?19614>
>> >>
>> >> _______________________________________________
>> >>   Message sent via/by Savannah
>> >>   http://savannah.nongnu.org/
>> >>
>> >>
>> >
>> >
>> > _______________________________________________
>> > Gluster-devel mailing list
>> > Gluster-devel at nongnu.org
>> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>> >
>> 
>