Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All highmem jobs are stuck in queue #2971

Closed
NotMyFault opened this issue Jun 2, 2022 · 15 comments
Closed

All highmem jobs are stuck in queue #2971

NotMyFault opened this issue Jun 2, 2022 · 15 comments

Comments

@NotMyFault
Copy link
Member

Service(s)

ci.jenkins.io

Summary

Currently, several ATH jobs are building on several highmem nodes which appear all to be stuck. There's no new high memory node coming online either, causing new ATH jobs to be stuck in the build queue while existing ones are stalled.

There's one high memory node in the "launching" phase, but I can't see that there's any progress.

Reproduction steps

No response

@github-actions github-actions bot added ci.jenkins.io triage Incoming issues that need review labels Jun 2, 2022
@MarkEWaite
Copy link

Also shown by the check agent availability job job on ci.jenkins.io

@dduportal dduportal added this to the infra-team-sync-2022-06-07 milestone Jun 3, 2022
@dduportal dduportal self-assigned this Jun 3, 2022
@dduportal dduportal removed the triage Incoming issues that need review label Jun 3, 2022
@dduportal
Copy link
Contributor

  • Set Jenkins in "safe mode" (builds are queued but not executed
  • Had to stop manually all the running builds: they all were stuck trying to contact their executors. It's weird because I expect timeout to kick: smells liek a bug :(
  • For each stopped build, triggered a build (that was queued)
  • Restarting Jenkins

@dduportal
Copy link
Contributor

dduportal commented Jun 3, 2022

Another weird thing: the EC2 higmem agents are not started by the scheduler. Trying to spawn a few manually.

Nevermind: they are started as expected.

@dduportal
Copy link
Contributor

@dduportal
Copy link
Contributor

Side note: there has been issues with repo.jenkins-ci.org during the past 9 hours:

Capture d’écran 2022-06-03 à 07 57 49

@dduportal
Copy link
Contributor

OH: I missed the new features of the azure-vm-agents plugin about VM retention (the "Idle retention" was the only one available last time I checked) and the ability to limit amount of VMs per kind of template

Capture d’écran 2022-06-03 à 08 49 36

  • We should switch to the "one shot" retention strategy on all our controllers
  • That explains the "suspended" state
  • We might want to update the azure vm config to only have one azure cloud (requires puppet template overhaul) so only one GC

@timja
Copy link
Member

timja commented Jun 3, 2022

Those features have been there forever, although need to check I didn't break the once one by reconnecting

@dduportal
Copy link
Contributor

I wonder if Olivier and I missed it because of the UI last year then, but I'm 101% sure we did not find it. not sure that it would fix the root issue there but would help.

@dduportal
Copy link
Contributor

@timja Weird: there are a lot of azure agents on ci.j, but on Azure portal I don't see them.
Logs are telling "reconnect", which feels weird, as if the azure VM GC was not picking it up.
I've tried the "save cloud config on UI" but did not change, so I'm going to:

  • stop the puppet agent
  • apply the "once retention strategy" manually

@timja
Copy link
Member

timja commented Jun 3, 2022

I'm wondering if this had some unintended consequences:
jenkinsci/azure-vm-agents-plugin#359

@timja
Copy link
Member

timja commented Jun 3, 2022

Can revert if needed

@dduportal
Copy link
Contributor

Can revert if needed

I would revert the plugin on ci.jenkins.io instead.

Would you need any logs or debug info to help tracking the plugin behavior?

@timja
Copy link
Member

timja commented Jun 3, 2022

I would revert the plugin on ci.jenkins.io instead.

I can reproduce an issue with one shot strategy anyway, I don't have time to fix properly currently so going to revert for now

@dduportal
Copy link
Contributor

First step on the "retention" cleanup to "once applied in jenkins-infra/jenkins-infra#2200

@dduportal
Copy link
Contributor

Closing as the queue has been cleaned up and agents are now in better shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants