[Gluster-infra] regression machines reporting slowly ? here is the reason ...

Niels de Vos ndevos at redhat.com
Sun Apr 24 13:59:40 UTC 2016


On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur <vbellur at redhat.com> wrote:
> > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever <pkalever at redhat.com> wrote:
> >> Hi all,
> >>
> >> Noticed our regression machines are reporting back really slow,
> >> especially CentOs and Smoke
> >>
> >> I found that most of the slaves are marked offline, this could be the
> >> biggest reasons ?
> >>
> >>
> >
> > Regression machines are scheduled to be offline if there are no active
> > jobs. I wonder if the slowness is related to LVM or related factors as
> > detailed in a recent thread?
> >
> 
> Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
> 
> Hi Vijay,
> 
> Honestly I was not aware of this case where the machines move to
> offline state by them self, I was only aware that they just go to idle
> state,
> Thanks for sharing that information. But we still need to reclaim most
> of machines, Here are the reasons why each of them are offline.

Well, slaves go into offline, and should be woken up when needed.
However it seems that Jenkins fails to connect to many slaves :-/

I've rebooted:

 - slave46
 - slave28
 - slave26
 - slave25
 - slave24
 - slave23
 - slave21

These all seem to have come up correctly after clicking the 'Lauch slave
agent' button on the slave's status page.

Remember that anyone with a Jankins account can reboot VMs. This most
often is sufficient to get them working again. Just go to
https://build.gluster.org/job/reboot-vm/ , login and press some buttons.

One slave is in a weird status, maybe one of the tests overwrote the ssh
key?

    [04/24/16 06:48:02] [SSH] Opening SSH connection to slave29.cloud.gluster.org:22.
    ERROR: Failed to authenticate as jenkins. Wrong password. (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
    [04/24/16 06:48:04] [SSH] Authentication failed.
    hudson.AbortException: Authentication failed.
    	at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
    	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
    	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:745)
    [04/24/16 06:48:04] Launch failed - cleaning up connection
    [04/24/16 06:48:05] [SSH] Connection closed.

Leaving slave29 as is, maybe one of our admins can have a look and see
if it needs reprovisioning.

Cheers,
Niels

> 
> 
> CentOs slaves:     Hardly (2/14) salves are online [1]
> 
> slave20.cloud.gluster.org (online)
> slave21.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave22.cloud.gluster.org (online)
> slave23.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave24.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave25.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave26.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
> rastar taking this down for pranith. Needed for debugging with tar
> issue.  Apr 20, 2016 3:44:14 AM]
> slave28.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave29.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> 
> slave32.cloud.gluster.org [Offline Reason: idle]
> slave33.cloud.gluster.org [Offline Reason: idle]
> slave34.cloud.gluster.org [Offline Reason: idle]
> 
> slave46.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> 
> 
> 
> 
> Smoke slaves:      Hardly (2/15) slaves are online [2]
> 
> slave20.cloud.gluster.org (onine)
> slave21.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave22.cloud.gluster.org (online)
> slave23.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave24.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave25.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave26.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
> rastar taking this down for pranith. Needed for debugging with tar
> issue.Apr 20, 2016 3:44:14 AM]
> slave28.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave29.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> 
> slave32.cloud.gluster.org [Offline Reason: idle]
> slave33.cloud.gluster.org [Offline Reason: idle]
> slave34.cloud.gluster.org [Offline Reason: idle]
> 
> slave46.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> slave47.cloud.gluster.org [Offline Reason: idle]
> 
> 
> 
> 
> Netbsd slaves:       Only (6 /11) are online [3]
> 
> nbslave71.cloud.gluster.org (online)
> nbslave72.cloud.gluster.org [Offline Reason: This node is offline
> because Jenkins failed to launch the slave agent on it.]
> nbslave74.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
> Mar 21, 2016 10:59:43 PM]
> nbslave75.cloud.gluster.org (online)
> nbslave77.cloud.gluster.org (online)
> nbslave79.cloud.gluster.org (online)
> 
> nbslave7c.cloud.gluster.org (online)
> nbslave7g.cloud.gluster.org [Ofline Reason: Disconnected by rastar :
> anoop is using this to debug netbsd related issue Mar 29, 2016 2:27:20
> AM]
> nbslave7h.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
> Apr 13, 2016 3:15:06 AM]
> nbslave7i.cloud.gluster.org [Ofline Reason: Disconnected by jdarcy :
> Consistently generating spurious failures due to ping timeouts. This
> costs people *hours* for a platform nobody uses except as a test for
> perfused. Feb 27, 2016 9:09:09 PM]
> nbslave7j.cloud.gluster.org (online)
> 
> 
> Summary:
> 
> For CentOs Regressions: 9/14 slaves were completely down  [not just idle]
> For Smoke: 9/15 slaves were completely down
> For Netbsd Regressions: 5/11 slaves were completely down.
> 
> IIRC, for CentOs regression and Smoke jobs we use common machines. so,
>  9 (CR+S) + 5 (NR) = 14 slaves were down. So on total (Centos [+ Smoke
> ] + Netbsd) 14/26 machines were down [Not just due to Idle state]
> 
> 
> 
> https://build.gluster.org/label/rackspace_regression_2gb/
> https://build.gluster.org/label/smoke_tests/
> https://build.gluster.org/label/netbsd7_regression/
> 
> Thanks,
> --
> Prasanna
> 
> 
> > -Vijay
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-infra
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20160424/12143e07/attachment.sig>


More information about the Gluster-infra mailing list