[Gluster-devel] bug-857330/normal.t failure

Thu May 22 12:34:29 UTC 2014

Thanks Justin, I found the problem. The VM can be deleted now.

Turns out, there was more than enough time for the rebalance to complete.
But we hit a race, which caused a command to fail.

The particular test that failed is waiting for rebalance to finish. It does
this by doing a 'gluster volume rebalance <> status' command and checking
the result. The EXPECT_WITHIN function runs this command till we have a
match, the command fails or the timeout happens.

For a rebalance status command, glusterd sends a request to the rebalance
process (as a brick_op) to get the latest stats. It had done the same in
this case as well. But while glusterd was waiting for the reply, the
rebalance completed and the process stopped itself. This caused the rpc
connection between glusterd and rebalance proc to close. This caused the
all pending requests to be unwound as failures. Which in turnlead to the
command failing.

I cannot think of a way to avoid this race from within glusterd. For this
particular test, we could avoid using the 'rebalance status' command if we
directly checked the rebalance process state using its pid etc. I don't
particularly approve of this approach, as I think I used the 'rebalance
status' command for a reason. But I currently cannot recall the reason, and
if cannot come with it soon, I wouldn't mind changing the test to avoid
rebalance status.

~kaushal

On Thu, May 22, 2014 at 5:22 PM, Justin Clift <justin at gluster.org> wrote:

> On 22/05/2014, at 12:32 PM, Kaushal M wrote:
> > I haven't yet. But I will.
> >
> > Justin,
> > Can I get take a peek inside the vm?
>
> Sure.
>
>   IP: 23.253.57.20
>   User: root
>   Password: foobar123
>
> The stdout log from the regression test is in /tmp/regression.log.
>
> The GlusterFS git repo is in /root/glusterfs.  Um, you should be
> able to find everything else pretty easily.
>
> Btw, this is just a temp VM, so feel free to do anything you want
> with it.  When you're finished with it let me know so I can delete
> it. :)
>
> + Justin
>
>
> > ~kaushal
> >
> >
> > On Thu, May 22, 2014 at 4:53 PM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> > Kaushal,
> >    Rebalance status command seems to be failing sometimes. I sent a mail
> about such spurious failure earlier today. Did you get a chance to look at
> the logs and confirm that rebalance didn't fail and it is indeed a timeout?
> >
> > Pranith
> > ----- Original Message -----
> > > From: "Kaushal M" <kshlmster at gmail.com>
> > > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > Cc: "Justin Clift" <justin at gluster.org>, "Gluster Devel" <
> gluster-devel at gluster.org>
> > > Sent: Thursday, May 22, 2014 4:40:25 PM
> > > Subject: Re: [Gluster-devel] bug-857330/normal.t failure
> > >
> > > The test is waiting for rebalance to finish. This is a rebalance with
> some
> > > actual data so it could have taken a long time to finish. I did set a
> > > pretty high timeout, but it seems like it's not enough for the new VMs.
> > >
> > > Possible options are,
> > > - Increase this timeout further
> > > - Reduce the amount of data. Currently this is 100 directories with 10
> > > files each of size between 10-500KB
> > >
> > > ~kaushal
> > >
> > >
> > > On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri <
> > > pkarampu at redhat.com> wrote:
> > >
> > > > Kaushal has more context about these CCed. Keep the setup until he
> > > > responds so that he can take a look.
> > > >
> > > > Pranith
> > > > ----- Original Message -----
> > > > > From: "Justin Clift" <justin at gluster.org>
> > > > > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > > > Cc: "Gluster Devel" <gluster-devel at gluster.org>
> > > > > Sent: Thursday, May 22, 2014 3:54:46 PM
> > > > > Subject: bug-857330/normal.t failure
> > > > >
> > > > > Hi Pranith,
> > > > >
> > > > > Ran a few VM's with your Gerrit CR 7835 applied, and in "DEBUG"
> > > > > mode (I think).
> > > > >
> > > > > One of the VM's had a failure in bug-857330/normal.t:
> > > > >
> > > > >   Test Summary Report
> > > > >   -------------------
> > > > >   ./tests/basic/rpm.t                             (Wstat: 0 Tests:
> 0
> > > > Failed:
> > > > >   0)
> > > > >     Parse errors: Bad plan.  You planned 8 tests but ran 0.
> > > > >   ./tests/bugs/bug-857330/normal.t                (Wstat: 0 Tests:
> 24
> > > > Failed:
> > > > >   1)
> > > > >     Failed test:  13
> > > > >   Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr  1.73 sys +
> > > > 941.82
> > > > >   cusr 645.54 csys = 1591.22 CPU)
> > > > >   Result: FAIL
> > > > >
> > > > > Seems to be this test:
> > > > >
> > > > >   COMMAND="volume rebalance $V0 status"
> > > > >   PATTERN="completed"
> > > > >   EXPECT_WITHIN 300 $PATTERN get-task-status
> > > > >
> > > > > Is this one on your radar already?
> > > > >
> > > > > Btw, this VM is still online.  Can give you access to retrieve logs
> > > > > if useful.
> > > > >
> > > > > + Justin
> > > > >
> > > > > --
> > > > > Open Source and Standards @ Red Hat
> > > > >
> > > > > twitter.com/realjustinclift
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Gluster-devel mailing list
> > > > Gluster-devel at gluster.org
> > > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > > >
> > >
> >
>
> --
> Open Source and Standards @ Red Hat
>
> twitter.com/realjustinclift
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140522/91fd6aa2/attachment.html>