Skip to content

Add back-off retry for pod log streaming to kubernetes backend#5550

Merged
xoxys merged 2 commits into
woodpecker-ci:mainfrom
henkka:k8s-tail-logs
Sep 25, 2025
Merged

Add back-off retry for pod log streaming to kubernetes backend#5550
xoxys merged 2 commits into
woodpecker-ci:mainfrom
henkka:k8s-tail-logs

Conversation

@henkka

@henkka henkka commented Sep 25, 2025

Copy link
Copy Markdown
Contributor

Problem

Woodpecker CI fails with

rpc error: code = Unknown desc = workflow finished with error Get "https://10.0.9.147:10250/containerLogs/...": remote error: tls: internal error

when trying to read container logs from newly provisioned Kubernetes worker nodes. This occurs because:

  • New worker nodes require CSR (Certificate Signing Request) approval before kubelet can serve TLS requests
  • Woodpecker attempts to read logs immediately
  • Race condition: log collection happens faster than certificate provisioning

This can be reproduced by ensuring that the Woodpecker step causes creation of new Kubernetes worker node (we are using Karpenter that handles it), where from time to time it fails with the error.

https://repost.aws/questions/QUK8WLbLYlSs2lKw__7h-uOQ/eks-remote-error-tls-internal-error-when-running-kubectl-logs-command led to us to understand the issue better, and the fact that when provisioning a new worker node, the kubectl get csr --watch shows 15s period of time when the CSR is pending

csr-d5fmg   0s      kubernetes.io/kubelet-serving   system:node:<node>  <none>              Pending
csr-d5fmg   15s     kubernetes.io/kubelet-serving   system:node:<node>   <none>              Approved
csr-d5fmg   15s     kubernetes.io/kubelet-serving   system:node:<node>   <none>              Approved,Issued
sequenceDiagram
    participant WP as Woodpecker CI
    participant K8s as Kubernetes API
    participant Node as New Worker Node
    participant Kubelet as Kubelet
    
    WP->>K8s: Create build pod
    K8s->>Node: Schedule pod
    Node->>Kubelet: Start container
    Note over Node: CSR not yet approved
    WP->>Kubelet: GET /containerLogs (too early!)
    Kubelet-->>WP: TLS: internal error
    Note over Node: CSR gets approved later

Loading

Solution

Added retry logic with exponential backoff to the TailStep function in the Kubernetes backend to improve resiliency when fetching log streams. This prevents Woodpecker from marking steps as failed due to temporary issues (such as our specific issue with newly provisioned nodes where kubelet certificates are not yet valid when Woodpecker is already trying to tail the logs).

@qwerty287

Copy link
Copy Markdown
Contributor

Thanks. I cannot tell anything about how this fixes the issue as I don't know much a out k8s, but just one note: for backoffs we use github.com/cenkalti/backoff/v5 at other points already. No need to reimplement it.

@qwerty287 qwerty287 added bug Something isn't working backend/kubernetes labels Sep 25, 2025
@henkka

henkka commented Sep 25, 2025

Copy link
Copy Markdown
Contributor Author

Thanks. I cannot tell anything about how this fixes the issue as I don't know much a out k8s, but just one note: for backoffs we use github.com/cenkalti/backoff/v5 at other points already. No need to reimplement it.

🙏 modified to use the mentioned library for the backoff logic -> 86b7bba

Basically this adds resiliency to Woodpecker Kubernetes backend so it doesn't fail steps so easily when there are temporary issues with fetching the log stream (e.g. in our case the mentioned issue with Woodpecker tailing the logs before the certificates for the kubelet are valid, and that error causing Woodpecker to consider the step as failed)

@xoxys xoxys enabled auto-merge (squash) September 25, 2025 18:29
@xoxys xoxys changed the title fix(k8s): add retry logic with exponential backoff for pod log streaming Add back-off retry for pod log streaming to kubernetes backend Sep 25, 2025
@codecov

codecov Bot commented Sep 25, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.52%. Comparing base (388557d) to head (86b7bba).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pipeline/backend/kubernetes/kubernetes.go 0.00% 13 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5550   +/-   ##
=======================================
  Coverage   19.51%   19.52%           
=======================================
  Files         416      416           
  Lines       39560    39566    +6     
=======================================
+ Hits         7720     7725    +5     
- Misses      31143    31144    +1     
  Partials      697      697           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xoxys xoxys merged commit a3c3846 into woodpecker-ci:main Sep 25, 2025
8 of 9 checks passed
@woodpecker-bot woodpecker-bot mentioned this pull request Sep 25, 2025
1 task
@henkka henkka deleted the k8s-tail-logs branch September 25, 2025 18:41
simonckemper added a commit to simonckemper/woodpecker that referenced this pull request May 29, 2026
…et finalized

Kubelet sets pod.Status.Phase = Succeeded before finalizing
containerStatuses[0].state.terminated. When the informer sees the
phase change and WaitStep calls Get(), the container status may still
show Terminated==nil, causing a hard error:

  no terminated state found for container wp-XXX/wp-XXX

This is especially likely under load (apiserver latency spikes,
ResourceQuota admission storms) where the window between phase
transition and container status finalization widens from milliseconds
to seconds.

Fix: wrap the post-informer Get() + terminated check in a backoff.Retry
loop using the same configuration as TailStep (backoff.NewExponentialBackOff
+ maxRetryDuration). 404 returns backoff.Permanent (no retry — pod
was GC'd after known-successful completion via the podDeleted handler).

Evidence: 15+ pipeline failures in RootReal CI over several months.
Fast-completing steps (cargo lock guard, audit) are most affected
because they finish inside the GC cleanup window with high probability.

Upstream: PR woodpecker-ci#5550 established the backoff.Retry pattern for TailStep;
this applies the same pattern to WaitStep.
6543 pushed a commit that referenced this pull request May 30, 2026
…et finalized (#6672)

## Problem

Kubelet sets `pod.Status.Phase = Succeeded` before finalizing `containerStatuses[0].state.terminated`. When the informer sees the phase change and `WaitStep` calls `Get()`, the container status may still show `Terminated == nil`, causing a hard error:

```
no terminated state found for container wp-XXX/wp-XXX
```

This is a known race in the Kubernetes API server/kubelet eventually-consistent model. The window is normally milliseconds but widens to seconds under load (apiserver latency spikes, ResourceQuota admission storms, node pressure).

## Fix

Wrap the post-informer `Get()` + `Terminated == nil` check in `backoff.Retry` with exponential backoff (200ms initial, 5s max interval, 15s total budget). This mirrors the retry pattern already used for `TailStep` log stream recovery (#5550).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/kubernetes bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants