fix(kubernetes): retry WaitStep when container terminated state not yet finalized#6672
Merged
6543 merged 3 commits intoMay 30, 2026
Merged
Conversation
…et finalized Kubelet sets pod.Status.Phase = Succeeded before finalizing containerStatuses[0].state.terminated. When the informer sees the phase change and WaitStep calls Get(), the container status may still show Terminated==nil, causing a hard error: no terminated state found for container wp-XXX/wp-XXX This is especially likely under load (apiserver latency spikes, ResourceQuota admission storms) where the window between phase transition and container status finalization widens from milliseconds to seconds. Fix: wrap the post-informer Get() + terminated check in a backoff.Retry loop using the same configuration as TailStep (backoff.NewExponentialBackOff + maxRetryDuration). 404 returns backoff.Permanent (no retry — pod was GC'd after known-successful completion via the podDeleted handler). Evidence: 15+ pipeline failures in RootReal CI over several months. Fast-completing steps (cargo lock guard, audit) are most affected because they finish inside the GC cleanup window with high probability. Upstream: PR woodpecker-ci#5550 established the backoff.Retry pattern for TailStep; this applies the same pattern to WaitStep.
02516f6 to
7afe32c
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6672 +/- ##
==========================================
- Coverage 42.59% 42.58% -0.01%
==========================================
Files 436 436
Lines 29106 29115 +9
==========================================
+ Hits 12399 12400 +1
- Misses 15596 15606 +10
+ Partials 1111 1109 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6543
reviewed
May 30, 2026
6543
reviewed
May 30, 2026
Co-authored-by: 6543 <6543@obermui.de>
6543
approved these changes
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Kubelet sets
pod.Status.Phase = Succeededbefore finalizingcontainerStatuses[0].state.terminated. When the informer sees the phase change andWaitStepcallsGet(), the container status may still showTerminated == nil, causing a hard error:This is a known race in the Kubernetes API server/kubelet eventually-consistent model. The window is normally milliseconds but widens to seconds under load (apiserver latency spikes, ResourceQuota admission storms, node pressure).
Fix
Wrap the post-informer
Get()+Terminated == nilcheck inbackoff.Retrywith exponential backoff (200ms initial, 5s max interval, 15s total budget). This mirrors the retry pattern already used forTailSteplog stream recovery (#5550).Key design decisions:
backoff.Permanent: no retry needed — the pod was GC'd after the informer already observed completion. ReturnsExitCode=0, Exited=true(existing behavior from v3.15.0).Terminated == nilis retriable: kubelet is eventually consistent; a few hundred ms almost always resolves the race.Testing
go vet ./pipeline/backend/kubernetes/passesCGO_ENABLED=0v3.15.0-waitstep-retryEvidence
This race has caused 15+ false pipeline failures in our CI over the past months. The most affected steps are fast-completing ones (cargo lock guard, audit steps) that finish in 50-75 seconds and hit the GC window with high probability. After deploying the patched agent with this retry, no further
no terminated state founderrors have been observed.Related
backoff.RetryforTailSteplog streaming (same pattern)ctx.Done()to WaitStep)