[Gluster-infra] [Gluster-devel] NetBSD tests not running to completion.

Fri Jan 8 10:55:36 UTC 2016

On 01/08/2016 03:25 PM, Emmanuel Dreyfus wrote:
> On Fri, Jan 08, 2016 at 03:18:02PM +0530, Pranith Kumar Karampuri wrote:
>> Should the cleanup script needs to be manually executed on the NetBSD
>> machine?
> You can run the script manually, but if the goal is to restore a
> misbehaving machine, rebooting is probbaly the fastest way to sort
> the issue.
>
> While thinking about it, I suspect there may be some benefit
> into rebooting the machine if the regression does not finish
> within a sane amount of time.

Rebooting upon a single test leading to crash may not be a good idea. We 
need a reliable way of finding the need for finding that the mount hung 
because of crash and execute this cleanup script when that situation 
happens. So question is can we detect this state?

>
>>> First step could be to parse jenkins logs and find which test fail or hang
>>> most often in NetBSD regression
>> This work is under way. I will have to change some of the scripts I wrote to
>> get this information.
> Great.
>
>> To avoid duplication of work, did you take any tests that you are
>> already investigating? If not that is the first thing I will try to find out.
> No, I did not started investigating right now because I have no idea where
> I should look at. Your input will be very valuable.
Since we don't have the script now, I did this manually:

Here are the results for the last 15-20 runs:

Test Number of times it happened
tests/basic/afr/arbiter-statfs.t: bad status 1 
-------                        5
tests/basic/afr/self-heal.t -------                        1
tests/basic/afr/entry-self-heal.t -------                        1
tests/basic/quota-nfs.t -------                        2

The following happened: 4 times
One example: 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13283/console
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13280/console 
- this seems different compared to the one above.

+ '/opt/qa/build.sh'
Build timed out (after 300 minutes). Marking the build as failed.
Build was aborted
Finished: FAILURE

The following happened: 4 times
One example: 
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13279/console

ERROR: Connection was broken: java.io.IOException: Unexpected EOF
     at 
hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:99)
     at 
hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
     at 
hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
     at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

I can take a look at why tests are failing (On sunday, not today :-) ). 
Could you look at why the timeouts/'Connection broken' stuff is happening?

Once we find out what happened. First goal is to detect and repair it 
automatically. If we can't, let us write up a wiki page or something to 
tell how to proceed when this happens.

Pranith
>