You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What Operating System are you using (both controller, and any agents involved in the problem)?
rocky-linux-9-optimized-gcp
Reproduction steps
cause a lot of jenkins agents on spot VMs so that some of them get preempted shortly after they are created
Expected Results
The preempted agent should be detected and cleaned up & jobs should be assigned to another/new agent.
Actual Results
Jenkins loops this error message every 5s for several hours until it detects the agent as dead.
2024-06-25 14:17:33.041+0000 [id=133766] INFO c.g.j.p.c.ComputeEngineCloud#log: Waiting for SSH to come up. Sleeping 5.
2024-06-25 14:17:38.051+0000 [id=133636] INFO c.g.j.p.c.ComputeEngineCloud#log: Failed to connect via ssh: 404 Not Found
GET https://compute.googleapis.com/compute/v1/projects/XXX/zones/europe-west3-a/instances/YYY
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "The resource 'projects/XXX/zones/europe-west3-a/instances/YYY' was not found",
"reason" : "notFound"
} ],
"message" : "The resource 'projects/XXX/zones/europe-west3-a/instances/YYY' was not found"
}
Anything else?
It looks like a race condition to me when the agent is early in it's startup.
Are you interested in contributing a fix?
As a workaround, a script like below can be run periodically to find & kill such agents:
for (aSlave in hudson.model.Hudson.instance.slaves) {
if (aSlave.getComputer().isConnecting()){
if (aSlave.getComputer().countBusy() > 0) {
is404 = aSlave.getComputer().getLog().contains("Failed to connect via ssh: 404 Not Found");
if (is404){
Jenkins.instance.removeNode(aSlave);
}
}
}
}
The text was updated successfully, but these errors were encountered:
Yes, the management of preempted VMs is heavily bugged...
I think your ticket is related to this one #407, but there are other issues like this one: #310.
For the moment, preemptable VM are hardly usable with this plugin, this is a shame as there does not seem to be any active developer here anymore 😞
Jenkins and plugins versions report
Environment
What Operating System are you using (both controller, and any agents involved in the problem)?
rocky-linux-9-optimized-gcp
Reproduction steps
Expected Results
The preempted agent should be detected and cleaned up & jobs should be assigned to another/new agent.
Actual Results
Jenkins loops this error message every 5s for several hours until it detects the agent as dead.
Anything else?
It looks like a race condition to me when the agent is early in it's startup.
Are you interested in contributing a fix?
As a workaround, a script like below can be run periodically to find & kill such agents:
The text was updated successfully, but these errors were encountered: