Preempted VMs are sometimes not properly detected #469

bgp-sz · 2024-06-25T14:51:29Z

Jenkins and plugins versions report

Environment

Jenkins: 2.452.2
OS: Linux - 5.14.0-362.24.1.el9_3.cloud.0.6.x86_64
Java: 17.0.11 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
active-directory:2.35
analysis-model-api:12.3.3
ansicolor:1.0.4
ant:497.v94e7d9fffa_b_9
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-1.0
asm-api:9.7-33.v4d23ef79fcc8
atlassian-bitbucket-server-integration:4.0.0
authentication-tokens:1.113.v81215a_241826
badge:1.13
basic-branch-build-strategies:81.v05e333931c7d
bitbucket:241.v6d24a_57f9359
bitbucket-approval-filter:1.0.3
bitbucket-scm-trait-commit-skip:0.4.0
blueocean:1.27.13
blueocean-bitbucket-pipeline:1.27.13
blueocean-commons:1.27.13
blueocean-config:1.27.13
blueocean-core-js:1.27.13
blueocean-dashboard:1.27.13
blueocean-display-url:2.4.2
blueocean-events:1.27.13
blueocean-git-pipeline:1.27.13
blueocean-github-pipeline:1.27.13
blueocean-i18n:1.27.13
blueocean-jwt:1.27.13
blueocean-personalization:1.27.13
blueocean-pipeline-api-impl:1.27.13
blueocean-pipeline-editor:1.27.13
blueocean-pipeline-scm-api:1.27.13
blueocean-rest:1.27.13
blueocean-rest-impl:1.27.13
blueocean-web:1.27.13
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-233.vfdcdeb_0a_08a_a_
branch-api:2.1169.va_f810c56e895
build-blocker-plugin:166.vc82fc20b_a_ed6
build-timeout:1.32
buildtriggerbadge:251.vdf6ef853f3f5
byte-buddy-api:1.14.17-21.v290f2a_ff9732
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloudbees-bitbucket-branch-source:886.v44cf5e4ecec5
cloudbees-folder:6.928.v7c780211d66e
command-launcher:107.v773860566e2e
commons-lang3-api:3.14.0-76.vda_5591261cfe
commons-text-api:1.12.0-119.v73ef73f2345d
compact-columns:1.185.vf3851b_4d31fe
config-file-provider:973.vb_a_80ecb_9a_4d0
configuration-as-code:1810.v9b_c30a_249a_4c
configuration-as-code-groovy:1.1
copyartifact:746.vd2a_674fb_4f6f
credentials:1337.v60b_d7b_c7b_c9f
credentials-binding:677.vdc9d38cb_254d
cron_column:1.7
data-tables-api:2.0.8-1
database:247.v244b_d85f086d
database-mysql:63.va_0596d2b_1438
declarative-pipeline-migration-assistant:1.6.4
declarative-pipeline-migration-assistant-api:1.6.4
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-traceability:1.2
docker-workflow:580.vc0c340686b_54
downstream-build-cache:1.7
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
eddsa-api:0.3.0-4.v84c6f0f4969e
email-ext:1814.v404722f34263
envinject:2.908.v66a_774b_31d93
envinject-api:1.199.v3ce31253ed13
extended-read-permission:53.v6499940139e5
external-monitor-job:215.v2e88e894db_f8
extra-columns:1.26
favorite:2.208.v91d65b_7792a_c
file-operations:214.v2e7dc7f25757
flyway-api:9.22.3-75.vfdfb_f75a_a_9b_e
font-awesome-api:6.5.2-1
forensics-api:2.4.0
generic-webhook-trigger:2.2.1
git:5.2.2
git-client:5.0.0
git-server:126.v0d945d8d2b_39
github:1.39.0
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1789.v5b_0c0cea_18c3
google-compute-engine:4.575.v6969b_7c435eb_
google-metadata-plugin:0.5
google-oauth-plugin:1.330.vf5e86021cb_ec
google-storage-plugin:1.360.v6ca_38618b_41f
gradle:2.12
groovy-postbuild:228.vcdb_cf7265066
gson-api:2.11.0-41.v019fcf6125dc
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
hashicorp-vault-plugin:368.v48134f694db_f
htmlpublisher:1.34
http_request:1.18
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-7
javax-mail-api:1.6.2-10
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.13
jersey2-api:2.42-147.va_28a_44603b_d5
jira:3.13
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
job-dsl:1.87
jobConfigHistory:1229.v3039470161a_d
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
junit-sql-storage:324.v90e2a_a_a_a_0dd7
ldap:725.v3cb_b_711b_1a_ef
lockable-resources:1255.vf48745da_35d0
logstash:2.5.0218.v0a_ff8fefc12b_
mailer:472.vf7c289a_4b_420
matrix-auth:3.2.2
matrix-project:832.va_66e270d2946
maven-plugin:3.23
mercurial:1260.vdfb_723cdcc81
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.12.1-113.v4d3ea_5eb_7f72
mina-sshd-api-core:2.12.1-113.v4d3ea_5eb_7f72
monitoring:1.99.0
mysql-api:8.4.0-31.va_b_5ce7933762
oauth-credentials:0.653.v14cf2088e950
oic-auth:4.284.v0cc21de03d37
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
opentelemetry:3.1215.vc9db_a_0b_34c2a_
pam-auth:1.11
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-config-history:1.6
pipeline-github-lib:61.v629f2cc41d83
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:727.ve832a_9244dfa_
pipeline-input-step:495.ve9c153f6067b_
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2198.v41dd8ef6dd56
pipeline-model-definition:2.2198.v41dd8ef6dd56
pipeline-model-extensions:2.2198.v41dd8ef6dd56
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2198.v41dd8ef6dd56
pipeline-timeline:1.0.3
pipeline-utility-steps:2.16.2
plain-credentials:182.v468b_97b_9dcb_8
plugin-usage-plugin:4.5
plugin-util-api:4.1.0
postgresql-api:42.7.2-40.v76d376d65c77
prism-api:1.29.0-15
pubsub-light:1.18
rebuild:332.va_1ee476d8f6d
remote-file:1.24
resource-disposer:0.23
role-strategy:727.vd344b_eec783d
scm-api:690.vfc8b_54395023
scm-filter-branch-pr:148.v0b_5f06e8b_c84
script-security:1341.va_2819b_414686
scriptler:348.v5d461e205da_a_
show-build-parameters:1.0
simple-theme-plugin:176.v39740c03a_a_f5
snakeyaml-api:2.2-111.vc6598e30cc65
sonar:2.17.2
sse-gateway:1.27
ssh-agent:367.vf9076cd4ee21
ssh-credentials:337.v395d2403ccd4
ssh-slaves:2.968.v6f8823c91de4
sshd:3.330.vc866a_8389b_58
startup-trigger-plugin:2.9.4
structs:337.v1b_04ea_4df7c8
swarm:3.46
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
uno-choice:2.8.3
validating-string-parameter:183.v3748e79b_9737
variant:60.v7290fc0eb_b_cd
view-job-filters:382.vdf2d5e3f02f0
warnings-ng:11.3.0
workflow-aggregator:596.v8c21c963d92d
workflow-api:1316.v33eb_726c50b_a_
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3903.v48a_8836749e9
workflow-durable-task-step:1353.v1891a_b_01da_18
workflow-job:1400.v7fd111b_ec82f
workflow-multibranch:783.787.v50539468395f
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:657.v03b_e8115821b_
workflow-support:907.v6713a_ed8a_573
ws-cleanup:0.46
yet-another-build-visualizer:1.16

What Operating System are you using (both controller, and any agents involved in the problem)?

rocky-linux-9-optimized-gcp

Reproduction steps

cause a lot of jenkins agents on spot VMs so that some of them get preempted shortly after they are created

Expected Results

The preempted agent should be detected and cleaned up & jobs should be assigned to another/new agent.

Actual Results

Jenkins loops this error message every 5s for several hours until it detects the agent as dead.

2024-06-25 14:17:33.041+0000 [id=133766]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Waiting for SSH to come up. Sleeping 5.
2024-06-25 14:17:38.051+0000 [id=133636]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Failed to connect via ssh: 404 Not Found
GET https://compute.googleapis.com/compute/v1/projects/XXX/zones/europe-west3-a/instances/YYY
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "The resource 'projects/XXX/zones/europe-west3-a/instances/YYY' was not found",
    "reason" : "notFound"
  } ],
  "message" : "The resource 'projects/XXX/zones/europe-west3-a/instances/YYY' was not found"
}

Anything else?

It looks like a race condition to me when the agent is early in it's startup.

Are you interested in contributing a fix?

As a workaround, a script like below can be run periodically to find & kill such agents:

for (aSlave in hudson.model.Hudson.instance.slaves) {
    if (aSlave.getComputer().isConnecting()){
        if (aSlave.getComputer().countBusy() > 0) {
            is404 = aSlave.getComputer().getLog().contains("Failed to connect via ssh: 404 Not Found");
            if (is404){
                Jenkins.instance.removeNode(aSlave);
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

guilim · 2024-07-31T10:42:53Z

Yes, the management of preempted VMs is heavily bugged...
I think your ticket is related to this one #407, but there are other issues like this one: #310.

For the moment, preemptable VM are hardly usable with this plugin, this is a shame as there does not seem to be any active developer here anymore 😞

bgp-sz added the bug Something isn't working label Jun 25, 2024

bgp-sz mentioned this issue Jun 26, 2024

Long wait times to provision a VM in GCP #463

Open

Artmorse mentioned this issue Nov 28, 2024

Terminate the instance when 404 occured. #489

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preempted VMs are sometimes not properly detected #469

Preempted VMs are sometimes not properly detected #469

bgp-sz commented Jun 25, 2024

guilim commented Jul 31, 2024

Preempted VMs are sometimes not properly detected #469

Preempted VMs are sometimes not properly detected #469

Comments

bgp-sz commented Jun 25, 2024

Jenkins and plugins versions report

What Operating System are you using (both controller, and any agents involved in the problem)?

Reproduction steps

Expected Results

Actual Results

Anything else?

Are you interested in contributing a fix?

guilim commented Jul 31, 2024