[Gluster-devel] Upstream smoke test failures

Poornima Gurusiddaiah pgurusid at redhat.com
Tue Nov 22 11:48:51 UTC 2016


Hi, 

Have posted a fix for hang in read : http://review.gluster.org/15901 
I think, it will fix the issue reported here. Please check the commit message of the patch 
for more details. 

Regards, 
Poornima 
----- Original Message -----

> From: "Nithya Balachandran" <nbalacha at redhat.com>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, November 22, 2016 3:23:59 AM
> Subject: Re: [Gluster-devel] Upstream smoke test failures

> On 22 November 2016 at 13:09, Raghavendra Gowdappa < rgowdapp at redhat.com >
> wrote:

> > ----- Original Message -----
> 
> > > From: "Vijay Bellur" < vbellur at redhat.com >
> 
> > > To: "Nithya Balachandran" < nbalacha at redhat.com >
> 
> > > Cc: "Gluster Devel" < gluster-devel at gluster.org >
> 
> > > Sent: Wednesday, November 16, 2016 9:41:12 AM
> 
> > > Subject: Re: [Gluster-devel] Upstream smoke test failures
> 
> > >
> 
> > > On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran
> 
> > > < nbalacha at redhat.com > wrote:
> 
> > > >
> 
> > > >
> 
> > > > On 15 November 2016 at 18:55, Vijay Bellur < vbellur at redhat.com >
> > > > wrote:
> 
> > > >>
> 
> > > >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran
> 
> > > >> < nbalacha at redhat.com > wrote:
> 
> > > >> >
> 
> > > >> >
> 
> > > >> > On 14 November 2016 at 21:38, Vijay Bellur < vbellur at redhat.com >
> > > >> > wrote:
> 
> > > >> >>
> 
> > > >> >> I would prefer that we disable dbench only if we have an owner for
> 
> > > >> >> fixing the problem and re-enabling it as part of smoke tests.
> > > >> >> Running
> 
> > > >> >> dbench seamlessly on gluster has worked for a long while and if it
> > > >> >> is
> 
> > > >> >> failing today, we need to address this regression asap.
> 
> > > >> >>
> 
> > > >> >> Does anybody have more context or clues on why dbench is failing
> > > >> >> now?
> 
> > > >> >>
> 
> > > >> > While I agree that it needs to be looked at asap, leaving it in
> > > >> > until
> > > >> > we
> 
> > > >> > get
> 
> > > >> > an owner seems rather pointless as all it does is hold up various
> 
> > > >> > patches
> 
> > > >> > and waste machine time. Re-triggering it multiple times so that it
> 
> > > >> > eventually passes does not add anything to the regression test
> > > >> > processes
> 
> > > >> > or
> 
> > > >> > validate the patch as we know there is a problem.
> 
> > > >> >
> 
> > > >> > I would vote for removing it and assigning someone to look at it
> 
> > > >> > immediately.
> 
> > > >> >
> 
> > > >>
> 
> > > >> From the debugging done so far can we identify an owner to whom this
> 
> > > >> can be assigned? I looked around for related discussions and could
> 
> > > >> figure out that we are looking to get statedumps. Do we have more
> 
> > > >> information/context beyond this?
> 
> > > >>
> 
> > > > I have updated the BZ (
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1379228 )
> 
> > > > with info from the last failure - looks like hangs in write-behind and
> 
> > > > read-ahead.
> 
> > > >
> 
> > >
> 
> > >
> 
> > > I spent some time on this today and it does look like write-behind is
> 
> > > absorbing READs without performing any WIND/UNWIND actions. I have
> 
> > > attached a statedump from a slave that had the dbench problem (thanks,
> 
> > > Nigel!) to the above bug.
> 
> > >
> 
> > > Snip from statedump:
> 
> > >
> 
> > > [global.callpool.stack.2]
> 
> > > stack=0x7fd970002cdc
> 
> > > uid=0
> 
> > > gid=0
> 
> > > pid=31884
> 
> > > unique=37870
> 
> > > lk-owner=0000000000000000
> 
> > > op=READ
> 
> > > type=1
> 
> > > cnt=2
> 
> > >
> 
> > > [global.callpool.stack.2.frame.1]
> 
> > > frame=0x7fd9700036ac
> 
> > > ref_count=0
> 
> > > translator=patchy-read-ahead
> 
> > > complete=0
> 
> > > parent=patchy-readdir-ahead
> 
> > > wind_from=ra_page_fault
> 
> > > wind_to=FIRST_CHILD (fault_frame->this)->fops->readv
> 
> > > unwind_to=ra_fault_cbk
> 
> > >
> 
> > > [global.callpool.stack.2.frame.2]
> 
> > > frame=0x7fd97000346c
> 
> > > ref_count=1
> 
> > > translator=patchy-readdir-ahead
> 
> > > complete=0
> 
> > >
> 
> > >
> 
> > > Note that the frame which was wound from ra_page_fault() to
> 
> > > write-behind is not yet complete and write-behind has not progressed
> 
> > > the call. There are several callstacks with a similar signature in
> 
> > > statedump.
> 

> > I think the culprit here is read-ahead, not write-behind. If read fop was
> > dropped in write-behind, we should've seen a frame associated with
> > write-behind (complete=0 for a frame associated with a xlator indicates
> > frame was not unwound from _that_ xlator). But I didn't see any. Also empty
> > request queues in wb_inode corroborate the hypothesis. K
> 
> We have seen both . See comment#17 in
> https://bugzilla.redhat.com/show_bug.cgi?id=1379228 .

> regards,
> Nithya

> > arthick subrahmanya is working on a similar issue reported by a user.
> > However, we've not made much of a progress till now.
> 

> > >
> 
> > > In write-behind's readv implementation, we stub READ fops and enqueue
> 
> > > them in the relevant inode context. Once enqueued the stub resumes
> 
> > > when appropriate set of conditions happen in write-behind. This is not
> 
> > > happening now and I am not certain if:
> 
> > >
> 
> > > - READ fops are languishing in a queue and not being resumed or
> 
> > > - READ fops are pre-maturely dropped from a queue without winding or
> 
> > > unwinding
> 
> > >
> 
> > > When I gdb'd into the client process and examined the inode contexts
> 
> > > for write-behind, I found all queues to be empty. This seems to
> 
> > > indicate that the latter reason is more plausible but I have not yet
> 
> > > found a code path to account for this possibility.
> 
> > >
> 
> > > One approach to proceed further is to add more logs in write-behind to
> 
> > > get a better understanding of the problem. I will try that out
> 
> > > sometime later this week. We are also considering disabling
> 
> > > write-behind for smoke tests in the interim after a trial run (with
> 
> > > write-behind disabled) later in the day.
> 
> > >
> 
> > > Thanks,
> 
> > > Vijay
> 
> > > _______________________________________________
> 
> > > Gluster-devel mailing list
> 
> > > Gluster-devel at gluster.org
> 
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> > >
> 

> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161122/d56748a8/attachment-0001.html>


More information about the Gluster-devel mailing list