[Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

Mon Apr 25 11:21:17 UTC 2016

Le lundi 25 avril 2016 à 13:09 +0200, Niels de Vos a écrit :
> On Mon, Apr 25, 2016 at 11:58:56AM +0200, Michael Scherer wrote:
> > Le lundi 25 avril 2016 à 11:26 +0200, Michael Scherer a écrit :
> > > Le lundi 25 avril 2016 à 11:12 +0200, Niels de Vos a écrit :
> > > > On Mon, Apr 25, 2016 at 10:43:13AM +0200, Michael Scherer wrote:
> > > > > Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> > > > > > On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > > > > > > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur <vbellur at redhat.com> wrote:
> > > > > > > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever <pkalever at redhat.com> wrote:
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> Noticed our regression machines are reporting back really slow,
> > > > > > > >> especially CentOs and Smoke
> > > > > > > >>
> > > > > > > >> I found that most of the slaves are marked offline, this could be the
> > > > > > > >> biggest reasons ?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > > > Regression machines are scheduled to be offline if there are no active
> > > > > > > > jobs. I wonder if the slowness is related to LVM or related factors as
> > > > > > > > detailed in a recent thread?
> > > > > > > >
> > > > > > > 
> > > > > > > Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
> > > > > > > 
> > > > > > > Hi Vijay,
> > > > > > > 
> > > > > > > Honestly I was not aware of this case where the machines move to
> > > > > > > offline state by them self, I was only aware that they just go to idle
> > > > > > > state,
> > > > > > > Thanks for sharing that information. But we still need to reclaim most
> > > > > > > of machines, Here are the reasons why each of them are offline.
> > > > > > 
> > > > > > Well, slaves go into offline, and should be woken up when needed.
> > > > > > However it seems that Jenkins fails to connect to many slaves :-/
> > > > > > 
> > > > > > I've rebooted:
> > > > > > 
> > > > > >  - slave46
> > > > > >  - slave28
> > > > > >  - slave26
> > > > > >  - slave25
> > > > > >  - slave24
> > > > > >  - slave23
> > > > > >  - slave21
> > > > > > 
> > > > > > These all seem to have come up correctly after clicking the 'Lauch slave
> > > > > > agent' button on the slave's status page.
> > > > > > 
> > > > > > Remember that anyone with a Jankins account can reboot VMs. This most
> > > > > > often is sufficient to get them working again. Just go to
> > > > > > https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
> > > > > > 
> > > > > > One slave is in a weird status, maybe one of the tests overwrote the ssh
> > > > > > key?
> > > > > > 
> > > > > >     [04/24/16 06:48:02] [SSH] Opening SSH connection to slave29.cloud.gluster.org:22.
> > > > > >     ERROR: Failed to authenticate as jenkins. Wrong password. (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> > > > > >     [04/24/16 06:48:04] [SSH] Authentication failed.
> > > > > >     hudson.AbortException: Authentication failed.
> > > > > >     	at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> > > > > >     	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> > > > > >     	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> > > > > >     	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > > > > >     	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > > > >     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > > >     	at java.lang.Thread.run(Thread.java:745)
> > > > > >     [04/24/16 06:48:04] Launch failed - cleaning up connection
> > > > > >     [04/24/16 06:48:05] [SSH] Connection closed.
> > > > > > 
> > > > > > Leaving slave29 as is, maybe one of our admins can have a look and see
> > > > > > if it needs reprovisioning.
> > > > > 
> > > > > Seems slave29 was reinstalled and/or slightly damaged, it was no longer
> > > > > in salt configuration, but I could connect as root. 
> > > > > 
> > > > > It should work better now, but please tell me if anything is incorrect
> > > > > with it.
> > > > 
> > > > Hmm, not really. Launching the Jenkins slave agent in it through the
> > > > webui still fails the same:
> > > > 
> > > >   https://build.gluster.org/computer/slave29.cloud.gluster.org/log
> > > > 
> > > > Maybe the "jenkins" user on the slave has the wrong password?
> > > 
> > > So, it seems first that he had the wrong host key, so I changed that. 
> > > 
> > > I am looking at what is wrong, so do not put it offline :)
> > 
> > So the script to update the /etc/hosts file was not run, so it was using
> > the wrong ip.
> > 
> > Can we agree on getting ride of it now, since there is no need for it
> > anymore ?
> 
> I guess so, DNS should be stable now, right?

We can give a try, but since the likely issue was on the iweb side, this
should be fixed.

> > (then i will also remove the /etc/rax-reboot file from the various
> > slaves, and maybe replace with a ansible based system)
> 
> rax-reboot is only needed on build.gluster.org, none of the other
> machines have the API key to execute reboots of VMs in Rackspace.

Mhh, I guess i was looking at the wrong terminal. 

But still, i would prefer to have it managed in a different way, so i
will think about something.

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160425/44fef475/attachment.sig>