Add back-off retry for pod log streaming to kubernetes backend by henkka · Pull Request #5550 · woodpecker-ci/woodpecker

henkka · 2025-09-25T13:06:18Z

Problem

Woodpecker CI fails with

rpc error: code = Unknown desc = workflow finished with error Get "https://10.0.9.147:10250/containerLogs/...": remote error: tls: internal error

when trying to read container logs from newly provisioned Kubernetes worker nodes. This occurs because:

New worker nodes require CSR (Certificate Signing Request) approval before kubelet can serve TLS requests
Woodpecker attempts to read logs immediately
Race condition: log collection happens faster than certificate provisioning

This can be reproduced by ensuring that the Woodpecker step causes creation of new Kubernetes worker node (we are using Karpenter that handles it), where from time to time it fails with the error.

https://repost.aws/questions/QUK8WLbLYlSs2lKw__7h-uOQ/eks-remote-error-tls-internal-error-when-running-kubectl-logs-command led to us to understand the issue better, and the fact that when provisioning a new worker node, the kubectl get csr --watch shows 15s period of time when the CSR is pending

csr-d5fmg   0s      kubernetes.io/kubelet-serving   system:node:<node>  <none>              Pending
csr-d5fmg   15s     kubernetes.io/kubelet-serving   system:node:<node>   <none>              Approved
csr-d5fmg   15s     kubernetes.io/kubelet-serving   system:node:<node>   <none>              Approved,Issued

sequenceDiagram
    participant WP as Woodpecker CI
    participant K8s as Kubernetes API
    participant Node as New Worker Node
    participant Kubelet as Kubelet
    
    WP->>K8s: Create build pod
    K8s->>Node: Schedule pod
    Node->>Kubelet: Start container
    Note over Node: CSR not yet approved
    WP->>Kubelet: GET /containerLogs (too early!)
    Kubelet-->>WP: TLS: internal error
    Note over Node: CSR gets approved later

Solution

Added retry logic with exponential backoff to the TailStep function in the Kubernetes backend to improve resiliency when fetching log streams. This prevents Woodpecker from marking steps as failed due to temporary issues (such as our specific issue with newly provisioned nodes where kubelet certificates are not yet valid when Woodpecker is already trying to tail the logs).

qwerty287 · 2025-09-25T16:04:38Z

Thanks. I cannot tell anything about how this fixes the issue as I don't know much a out k8s, but just one note: for backoffs we use github.com/cenkalti/backoff/v5 at other points already. No need to reimplement it.

henkka · 2025-09-25T16:53:39Z

Thanks. I cannot tell anything about how this fixes the issue as I don't know much a out k8s, but just one note: for backoffs we use github.com/cenkalti/backoff/v5 at other points already. No need to reimplement it.

🙏 modified to use the mentioned library for the backoff logic -> 86b7bba

Basically this adds resiliency to Woodpecker Kubernetes backend so it doesn't fail steps so easily when there are temporary issues with fetching the log stream (e.g. in our case the mentioned issue with Woodpecker tailing the logs before the certificates for the kubelet are valid, and that error causing Woodpecker to consider the step as failed)

codecov · 2025-09-25T18:37:51Z

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.52%. Comparing base (388557d) to head (86b7bba).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pipeline/backend/kubernetes/kubernetes.go	0.00%	13 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5550   +/-   ##
=======================================
  Coverage   19.51%   19.52%           
=======================================
  Files         416      416           
  Lines       39560    39566    +6     
=======================================
+ Hits         7720     7725    +5     
- Misses      31143    31144    +1     
  Partials      697      697

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…et finalized Kubelet sets pod.Status.Phase = Succeeded before finalizing containerStatuses[0].state.terminated. When the informer sees the phase change and WaitStep calls Get(), the container status may still show Terminated==nil, causing a hard error: no terminated state found for container wp-XXX/wp-XXX This is especially likely under load (apiserver latency spikes, ResourceQuota admission storms) where the window between phase transition and container status finalization widens from milliseconds to seconds. Fix: wrap the post-informer Get() + terminated check in a backoff.Retry loop using the same configuration as TailStep (backoff.NewExponentialBackOff + maxRetryDuration). 404 returns backoff.Permanent (no retry — pod was GC'd after known-successful completion via the podDeleted handler). Evidence: 15+ pipeline failures in RootReal CI over several months. Fast-completing steps (cargo lock guard, audit) are most affected because they finish inside the GC cleanup window with high probability. Upstream: PR woodpecker-ci#5550 established the backoff.Retry pattern for TailStep; this applies the same pattern to WaitStep.

…et finalized (#6672) ## Problem Kubelet sets `pod.Status.Phase = Succeeded` before finalizing `containerStatuses[0].state.terminated`. When the informer sees the phase change and `WaitStep` calls `Get()`, the container status may still show `Terminated == nil`, causing a hard error: ``` no terminated state found for container wp-XXX/wp-XXX ``` This is a known race in the Kubernetes API server/kubelet eventually-consistent model. The window is normally milliseconds but widens to seconds under load (apiserver latency spikes, ResourceQuota admission storms, node pressure). ## Fix Wrap the post-informer `Get()` + `Terminated == nil` check in `backoff.Retry` with exponential backoff (200ms initial, 5s max interval, 15s total budget). This mirrors the retry pattern already used for `TailStep` log stream recovery (#5550).

fix(k8s): add retry logic with exponential backoff for pod log streaming

f3508c6

henkka force-pushed the k8s-tail-logs branch from 372900d to f3508c6 Compare September 25, 2025 13:08

qwerty287 added bug Something isn't working backend/kubernetes labels Sep 25, 2025

fix: use github.com/cenkalti/backoff/v5 library for backoff logic

86b7bba

xoxys approved these changes Sep 25, 2025

View reviewed changes

xoxys enabled auto-merge (squash) September 25, 2025 18:29

xoxys changed the title ~~fix(k8s): add retry logic with exponential backoff for pod log streaming~~ Add back-off retry for pod log streaming to kubernetes backend Sep 25, 2025

xoxys merged commit a3c3846 into woodpecker-ci:main Sep 25, 2025
8 of 9 checks passed

woodpecker-bot mentioned this pull request Sep 25, 2025

🎉 Release 3.10.0 #5443

Merged

1 task

henkka deleted the k8s-tail-logs branch September 25, 2025 18:41

BrewTestBot mentioned this pull request Sep 28, 2025

woodpecker-cli 3.10.0 Homebrew/homebrew-core#246107

Merged

simonckemper mentioned this pull request May 29, 2026

fix(kubernetes): retry WaitStep when container terminated state not yet finalized #6672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add back-off retry for pod log streaming to kubernetes backend#5550

Add back-off retry for pod log streaming to kubernetes backend#5550
xoxys merged 2 commits into
woodpecker-ci:mainfrom
henkka:k8s-tail-logs

henkka commented Sep 25, 2025 •

edited

Loading

Uh oh!

qwerty287 commented Sep 25, 2025

Uh oh!

henkka commented Sep 25, 2025

Uh oh!

codecov Bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

henkka commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

qwerty287 commented Sep 25, 2025

Uh oh!

henkka commented Sep 25, 2025

Uh oh!

codecov Bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henkka commented Sep 25, 2025 •

edited

Loading

codecov Bot commented Sep 25, 2025 •

edited

Loading