Add back-off retry for pod log streaming to kubernetes backend#5550
Conversation
372900d to
f3508c6
Compare
|
Thanks. I cannot tell anything about how this fixes the issue as I don't know much a out k8s, but just one note: for backoffs we use github.com/cenkalti/backoff/v5 at other points already. No need to reimplement it. |
🙏 modified to use the mentioned library for the backoff logic -> 86b7bba Basically this adds resiliency to Woodpecker Kubernetes backend so it doesn't fail steps so easily when there are temporary issues with fetching the log stream (e.g. in our case the mentioned issue with Woodpecker tailing the logs before the certificates for the kubelet are valid, and that error causing Woodpecker to consider the step as failed) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5550 +/- ##
=======================================
Coverage 19.51% 19.52%
=======================================
Files 416 416
Lines 39560 39566 +6
=======================================
+ Hits 7720 7725 +5
- Misses 31143 31144 +1
Partials 697 697 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…et finalized Kubelet sets pod.Status.Phase = Succeeded before finalizing containerStatuses[0].state.terminated. When the informer sees the phase change and WaitStep calls Get(), the container status may still show Terminated==nil, causing a hard error: no terminated state found for container wp-XXX/wp-XXX This is especially likely under load (apiserver latency spikes, ResourceQuota admission storms) where the window between phase transition and container status finalization widens from milliseconds to seconds. Fix: wrap the post-informer Get() + terminated check in a backoff.Retry loop using the same configuration as TailStep (backoff.NewExponentialBackOff + maxRetryDuration). 404 returns backoff.Permanent (no retry — pod was GC'd after known-successful completion via the podDeleted handler). Evidence: 15+ pipeline failures in RootReal CI over several months. Fast-completing steps (cargo lock guard, audit) are most affected because they finish inside the GC cleanup window with high probability. Upstream: PR woodpecker-ci#5550 established the backoff.Retry pattern for TailStep; this applies the same pattern to WaitStep.
…et finalized (#6672) ## Problem Kubelet sets `pod.Status.Phase = Succeeded` before finalizing `containerStatuses[0].state.terminated`. When the informer sees the phase change and `WaitStep` calls `Get()`, the container status may still show `Terminated == nil`, causing a hard error: ``` no terminated state found for container wp-XXX/wp-XXX ``` This is a known race in the Kubernetes API server/kubelet eventually-consistent model. The window is normally milliseconds but widens to seconds under load (apiserver latency spikes, ResourceQuota admission storms, node pressure). ## Fix Wrap the post-informer `Get()` + `Terminated == nil` check in `backoff.Retry` with exponential backoff (200ms initial, 5s max interval, 15s total budget). This mirrors the retry pattern already used for `TailStep` log stream recovery (#5550).
Problem
Woodpecker CI fails with
when trying to read container logs from newly provisioned Kubernetes worker nodes. This occurs because:
This can be reproduced by ensuring that the Woodpecker step causes creation of new Kubernetes worker node (we are using Karpenter that handles it), where from time to time it fails with the error.
https://repost.aws/questions/QUK8WLbLYlSs2lKw__7h-uOQ/eks-remote-error-tls-internal-error-when-running-kubectl-logs-command led to us to understand the issue better, and the fact that when provisioning a new worker node, the
kubectl get csr --watchshows 15s period of time when the CSR is pendingsequenceDiagram participant WP as Woodpecker CI participant K8s as Kubernetes API participant Node as New Worker Node participant Kubelet as Kubelet WP->>K8s: Create build pod K8s->>Node: Schedule pod Node->>Kubelet: Start container Note over Node: CSR not yet approved WP->>Kubelet: GET /containerLogs (too early!) Kubelet-->>WP: TLS: internal error Note over Node: CSR gets approved laterSolution
Added retry logic with exponential backoff to the
TailStepfunction in the Kubernetes backend to improve resiliency when fetching log streams. This prevents Woodpecker from marking steps as failed due to temporary issues (such as our specific issue with newly provisioned nodes where kubelet certificates are not yet valid when Woodpecker is already trying to tail the logs).