Skip to content

fix(kubernetes): retry WaitStep when container terminated state not yet finalized#6672

Merged
6543 merged 3 commits into
woodpecker-ci:mainfrom
simonckemper:fix/waitstep-terminated-nil-retry
May 30, 2026
Merged

fix(kubernetes): retry WaitStep when container terminated state not yet finalized#6672
6543 merged 3 commits into
woodpecker-ci:mainfrom
simonckemper:fix/waitstep-terminated-nil-retry

Conversation

@simonckemper

Copy link
Copy Markdown
Contributor

Problem

Kubelet sets pod.Status.Phase = Succeeded before finalizing containerStatuses[0].state.terminated. When the informer sees the phase change and WaitStep calls Get(), the container status may still show Terminated == nil, causing a hard error:

no terminated state found for container wp-XXX/wp-XXX

This is a known race in the Kubernetes API server/kubelet eventually-consistent model. The window is normally milliseconds but widens to seconds under load (apiserver latency spikes, ResourceQuota admission storms, node pressure).

Fix

Wrap the post-informer Get() + Terminated == nil check in backoff.Retry with exponential backoff (200ms initial, 5s max interval, 15s total budget). This mirrors the retry pattern already used for TailStep log stream recovery (#5550).

Key design decisions:

  • 404 returns backoff.Permanent: no retry needed — the pod was GC'd after the informer already observed completion. Returns ExitCode=0, Exited=true (existing behavior from v3.15.0).
  • No container statuses is retriable: apiserver may return the pod before status is populated.
  • Terminated == nil is retriable: kubelet is eventually consistent; a few hundred ms almost always resolves the race.

Testing

  • go vet ./pipeline/backend/kubernetes/ passes
  • Binary builds cleanly with CGO_ENABLED=0
  • Deployed in production at RootReal CI (52-service Rust monorepo) as v3.15.0-waitstep-retry

Evidence

This race has caused 15+ false pipeline failures in our CI over the past months. The most affected steps are fast-completing ones (cargo lock guard, audit steps) that finish in 50-75 seconds and hit the GC window with high probability. After deploying the patched agent with this retry, no further no terminated state found errors have been observed.

Related

…et finalized

Kubelet sets pod.Status.Phase = Succeeded before finalizing
containerStatuses[0].state.terminated. When the informer sees the
phase change and WaitStep calls Get(), the container status may still
show Terminated==nil, causing a hard error:

  no terminated state found for container wp-XXX/wp-XXX

This is especially likely under load (apiserver latency spikes,
ResourceQuota admission storms) where the window between phase
transition and container status finalization widens from milliseconds
to seconds.

Fix: wrap the post-informer Get() + terminated check in a backoff.Retry
loop using the same configuration as TailStep (backoff.NewExponentialBackOff
+ maxRetryDuration). 404 returns backoff.Permanent (no retry — pod
was GC'd after known-successful completion via the podDeleted handler).

Evidence: 15+ pipeline failures in RootReal CI over several months.
Fast-completing steps (cargo lock guard, audit) are most affected
because they finish inside the GC cleanup window with high probability.

Upstream: PR woodpecker-ci#5550 established the backoff.Retry pattern for TailStep;
this applies the same pattern to WaitStep.
@simonckemper simonckemper force-pushed the fix/waitstep-terminated-nil-retry branch from 02516f6 to 7afe32c Compare May 29, 2026 15:47
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.58%. Comparing base (2034874) to head (39f8fc7).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pipeline/backend/kubernetes/kubernetes.go 0.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6672      +/-   ##
==========================================
- Coverage   42.59%   42.58%   -0.01%     
==========================================
  Files         436      436              
  Lines       29106    29115       +9     
==========================================
+ Hits        12399    12400       +1     
- Misses      15596    15606      +10     
+ Partials     1111     1109       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread pipeline/backend/kubernetes/kubernetes.go Outdated
Comment thread pipeline/backend/kubernetes/kubernetes.go Outdated
@6543 6543 enabled auto-merge (squash) May 30, 2026 10:28
@6543 6543 merged commit a765cb8 into woodpecker-ci:main May 30, 2026
7 checks passed
@woodpecker-bot woodpecker-bot mentioned this pull request May 30, 2026
1 task
@6543 6543 added bug Something isn't working backend/kubernetes labels May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/kubernetes bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants