[Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

Prasanna Kalever pkalever at redhat.com
Sun Apr 24 15:56:49 UTC 2016


On Sun, Apr 24, 2016 at 7:29 PM, Niels de Vos <ndevos at redhat.com> wrote:
> On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
>> On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur <vbellur at redhat.com> wrote:
>> > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever <pkalever at redhat.com> wrote:
>> >> Hi all,
>> >>
>> >> Noticed our regression machines are reporting back really slow,
>> >> especially CentOs and Smoke
>> >>
>> >> I found that most of the slaves are marked offline, this could be the
>> >> biggest reasons ?
>> >>
>> >>
>> >
>> > Regression machines are scheduled to be offline if there are no active
>> > jobs. I wonder if the slowness is related to LVM or related factors as
>> > detailed in a recent thread?
>> >
>>
>> Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
>>
>> Hi Vijay,
>>
>> Honestly I was not aware of this case where the machines move to
>> offline state by them self, I was only aware that they just go to idle
>> state,
>> Thanks for sharing that information. But we still need to reclaim most
>> of machines, Here are the reasons why each of them are offline.
>
> Well, slaves go into offline, and should be woken up when needed.
> However it seems that Jenkins fails to connect to many slaves :-/
>
> I've rebooted:
>
>  - slave46
>  - slave28
>  - slave26
>  - slave25
>  - slave24
>  - slave23
>  - slave21
>
> These all seem to have come up correctly after clicking the 'Lauch slave
> agent' button on the slave's status page.
>
> Remember that anyone with a Jankins account can reboot VMs. This most
> often is sufficient to get them working again. Just go to
> https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
>
> One slave is in a weird status, maybe one of the tests overwrote the ssh
> key?
>
>     [04/24/16 06:48:02] [SSH] Opening SSH connection to slave29.cloud.gluster.org:22.
>     ERROR: Failed to authenticate as jenkins. Wrong password. (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
>     [04/24/16 06:48:04] [SSH] Authentication failed.
>     hudson.AbortException: Authentication failed.
>         at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
>         at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
>         at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>     [04/24/16 06:48:04] Launch failed - cleaning up connection
>     [04/24/16 06:48:05] [SSH] Connection closed.
>
> Leaving slave29 as is, maybe one of our admins can have a look and see
> if it needs reprovisioning.

That's really cool Neils, thank you!

It will be helpful if somebody with Jenkins login perms can reboot
netbsd slave nbslave72.cloud.gluster.org ?

the below mentioned netbsd slaves were marked as offline
intentionally, just in case if forget to restore the state to online
(Please ignore if they still needed for some other jobs or has some issues)


Kaushal :
nbslave74.cloud.gluster.org on Mar 21, 2016 10:59:43 PM
nbslave7h.cloud.gluster.org on Apr 13, 2016 3:15:06 AM


Raghavendra Talur:
nbslave7g.cloud.gluster.org on Mar 29, 2016 2:27:20 AM


Jeff Darcy:
nbslave7i.cloud.gluster.org on Feb 27, 2016 9:09:09 PM


Thanks,
--
Prasanna

>
> Cheers,
> Niels
>
>>
>>
>> CentOs slaves:     Hardly (2/14) salves are online [1]
>>
>> slave20.cloud.gluster.org (online)
>> slave21.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave22.cloud.gluster.org (online)
>> slave23.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave24.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave25.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave26.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
>> rastar taking this down for pranith. Needed for debugging with tar
>> issue.  Apr 20, 2016 3:44:14 AM]
>> slave28.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave29.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>>
>> slave32.cloud.gluster.org [Offline Reason: idle]
>> slave33.cloud.gluster.org [Offline Reason: idle]
>> slave34.cloud.gluster.org [Offline Reason: idle]
>>
>> slave46.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>>
>>
>>
>>
>> Smoke slaves:      Hardly (2/15) slaves are online [2]
>>
>> slave20.cloud.gluster.org (onine)
>> slave21.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave22.cloud.gluster.org (online)
>> slave23.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave24.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave25.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave26.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
>> rastar taking this down for pranith. Needed for debugging with tar
>> issue.Apr 20, 2016 3:44:14 AM]
>> slave28.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave29.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>>
>> slave32.cloud.gluster.org [Offline Reason: idle]
>> slave33.cloud.gluster.org [Offline Reason: idle]
>> slave34.cloud.gluster.org [Offline Reason: idle]
>>
>> slave46.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave47.cloud.gluster.org [Offline Reason: idle]
>>
>>
>>
>>
>> Netbsd slaves:       Only (6 /11) are online [3]
>>
>> nbslave71.cloud.gluster.org (online)
>> nbslave72.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> nbslave74.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
>> Mar 21, 2016 10:59:43 PM]
>> nbslave75.cloud.gluster.org (online)
>> nbslave77.cloud.gluster.org (online)
>> nbslave79.cloud.gluster.org (online)
>>
>> nbslave7c.cloud.gluster.org (online)
>> nbslave7g.cloud.gluster.org [Ofline Reason: Disconnected by rastar :
>> anoop is using this to debug netbsd related issue Mar 29, 2016 2:27:20
>> AM]
>> nbslave7h.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
>> Apr 13, 2016 3:15:06 AM]
>> nbslave7i.cloud.gluster.org [Ofline Reason: Disconnected by jdarcy :
>> Consistently generating spurious failures due to ping timeouts. This
>> costs people *hours* for a platform nobody uses except as a test for
>> perfused. Feb 27, 2016 9:09:09 PM]
>> nbslave7j.cloud.gluster.org (online)
>>
>>
>> Summary:
>>
>> For CentOs Regressions: 9/14 slaves were completely down  [not just idle]
>> For Smoke: 9/15 slaves were completely down
>> For Netbsd Regressions: 5/11 slaves were completely down.
>>
>> IIRC, for CentOs regression and Smoke jobs we use common machines. so,
>>  9 (CR+S) + 5 (NR) = 14 slaves were down. So on total (Centos [+ Smoke
>> ] + Netbsd) 14/26 machines were down [Not just due to Idle state]
>>
>>
>>
>> https://build.gluster.org/label/rackspace_regression_2gb/
>> https://build.gluster.org/label/smoke_tests/
>> https://build.gluster.org/label/netbsd7_regression/
>>
>> Thanks,
>> --
>> Prasanna
>>
>>
>> > -Vijay
>> _______________________________________________
>> Gluster-infra mailing list
>> Gluster-infra at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-infra


More information about the Gluster-devel mailing list