Skip to content

[eval] fix(agent): workflow runner use shutdown context#5

Open
Uzay-G wants to merge 8 commits into
eval/base-pr-6021from
eval/upstream-pr-6021
Open

[eval] fix(agent): workflow runner use shutdown context#5
Uzay-G wants to merge 8 commits into
eval/base-pr-6021from
eval/upstream-pr-6021

Conversation

@Uzay-G

@Uzay-G Uzay-G commented Feb 21, 2026

Copy link
Copy Markdown

Mirror of woodpecker-ci#6021 (MERGED) for Orpheus review evaluation.

Upstream: woodpecker-ci#6021


Original PR description:

What was happening

When a workflow/run was canceled from the UI or API, the server correctly marked the run as Canceled, but the runner machines continued executing job steps.

Specifically:

  • The cancel signal was received by the runner
  • The workflow state was updated on the server
  • Pipeline execution on the runner was not interrupted
  • Tests and other long-running steps kept running and consuming runner capacity

This resulted in:

  • Wasted compute time
  • Runner capacity remaining blocked
  • Confusing UX (UI shows Canceled, but jobs keep running)

Related issue

Closes woodpecker-ci#5925

Related PRs

This change complements previous cancellation-related fixes:

While those PRs focus on step and backend cleanup, this PR ensures that workflow cancellation is properly propagated to the runner execution context.

What this PR changes

This PR ensures that canceling a workflow immediately stops execution on the runner.

  • Propagates server-side cancel events to the runner workflow context
  • Cancels the workflow execution context as soon as a cancel signal is received
  • Ensures pipeline execution and backend steps are interrupted
  • Normalizes cancellation handling so the workflow consistently ends as Canceled

Why this approach

In Woodpecker, the workflow context is the single source of truth for execution lifecycle.

Previously, receiving a cancel event did not reliably cancel the workflow execution context, allowing pipeline steps to continue running.
By explicitly canceling the workflow context when a cancel signal is received:

  • The pipeline runtime receives context.Done()
  • Backend implementations (Docker, Kubernetes, etc.) can terminate running steps
  • Runner capacity is freed promptly

This aligns runner behavior with what the UI reports and avoids wasted resources.

Implementation ove

@orpheus-by-fulcrum-dev-uzay-g

Copy link
Copy Markdown

Review started.

@orpheus-by-fulcrum-dev-uzay-g orpheus-by-fulcrum-dev-uzay-g Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All checks passed.

Expected behavior

  • When a workflow is canceled from the UI/API, the server sends a cancel signal through the queue that propagates to the agent
  • The agent's Wait() goroutine receives the cancel error, sets the canceled flag, and cancels the workflow context
  • Pipeline execution stops at the next stage boundary or backend operation (StartStep, WaitStep, DestroyStep)
  • SIGTERM also cancels the workflow context, stopping execution
  • The canceled flag uses atomic.Bool for thread-safe access between the cancel goroutine and the main goroutine
  • The WorkflowState.Canceled field is removed; cancellation is now determined by the queue error state and normalized error handling
  • All three binaries (server, agent, CLI) build and start correctly with the refactored import paths

What happens

  • ✅ Cancellation propagates correctly: queue.ErrorAtOnce(ErrCancel)queue.Wait() returns error → agent cancel goroutine fires → canceled.Store(true)workflowCancel() → pipeline sees context.Done() → returns ErrCancel
  • ✅ Server starts, agent connects via gRPC and polls for workflows; health check returns 204
  • ✅ SIGTERM path works: direct cancel() call stops pipeline execution immediately
  • ✅ Normal completion path unaffected: queue.Done() causes Wait() to return nil, no false cancellation
  • ✅ All 11 queue tests, 2 pipeline tests, 1 RPC test, and 17 server/pipeline tests pass with no regressions
Detailed evidence

Setup

export PATH="/usr/local/go/bin:$HOME/go/bin:$PATH"
export GOPATH="$HOME/go"
export GOMODCACHE="$HOME/go/pkg/mod"

Build

All three binaries compile successfully:

$ make build-agent
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ... -o dist/woodpecker-agent
$ make build-server
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build ... -o dist/woodpecker-server
$ make build-cli
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ... -o dist/woodpecker-cli

$ ls -la dist/
-rwxr-xr-x woodpecker-agent  (52MB)
-rwxr-xr-x woodpecker-server (53MB)
-rwxr-xr-x woodpecker-cli    (65MB)

Server + Agent start

$ WOODPECKER_HOST=http://localhost:8000 \
  WOODPECKER_AGENT_SECRET=dev-agent-secret \
  WOODPECKER_GITEA=true \
  WOODPECKER_FORGE_URL=http://localhost:3000 \
  WOODPECKER_FORGE_CLIENT=dummy-client \
  WOODPECKER_FORGE_SECRET=dummy-secret \
  WOODPECKER_DATABASE_DATASOURCE=/tmp/woodpecker.sqlite \
  ./dist/woodpecker-server &

$ curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/healthz
204

$ WOODPECKER_SERVER=localhost:9000 \
  WOODPECKER_AGENT_SECRET=dev-agent-secret \
  WOODPECKER_BACKEND=local \
  ./dist/woodpecker-agent &

Agent output:
{"message":"agent registered with ID 2"}
{"message":"starting Woodpecker agent with version 'next-860fc34c1d' and backend 'local' using platform 'linux/amd64' running up to 1 pipelines in parallel"}
{"message":"polling new steps"}
{"message":"request next execution"}

Agent connects, registers, and starts polling for work.

Queue cancellation demo

Go program exercising the exact cancel path from the PR (queue → Wait → cancel):

$ go run /tmp/demo_cancel.go

1. Task pushed to queue
2. Task polled by agent: ID=workflow-123
3. Wait started (listening for cancel)
4. Canceling task via ErrorAtOnce (simulating server cancel)...
5. ErrorAtOnce completed
6. Wait returned with error: queue: task canceled
   -> This is correct! The cancel signal was received.

DEMO RESULT: Cancellation propagation works correctly.
  - Queue.ErrorAtOnce() closes the task's done channel with ErrCancel
  - Queue.Wait() unblocks and returns the error
  - Agent's cancel goroutine would see err != nil
  - Agent calls cancel() on workflow context
  - Pipeline stops at next stage boundary or backend call

--- Testing normal completion path ---
7. Task polled: ID=workflow-456
8. Wait returned nil for normal completion - correct!

ALL DEMO SCENARIOS PASSED

E2E cancel with context propagation

Go program simulating the full runner path (queue → Wait → cancel → context.Done → pipeline stops):

$ go run /tmp/demo_e2e_cancel.go

Task polled by agent: cancel-test-1
Cancel listener started (like runner goroutine)
Pipeline started (blocking on context)...

=== Server-side cancel (simulating UI cancel) ===
ErrorAtOnce sent with ErrCancel
Cancel signal received! err=queue: task canceled

Pipeline stopped with: Canceled
Canceled flag: true
Final error: Canceled

SUCCESS: Cancellation propagated correctly through the full path:
  Server cancel -> queue.ErrorAtOnce(ErrCancel)
  -> queue.Wait() returns error
  -> canceled.Store(true)
  -> workflowCancel() called
  -> pipeline sees context.Done() -> returns ErrCancel
  -> normalized to ErrCancel + canceled=true

=== Testing agent SIGTERM path ===
Task polled: sigterm-test-1
Simulating SIGTERM -> calling workflowCancel()
Pipeline stopped with ErrCancel after SIGTERM - correct!

ALL E2E CANCELLATION SCENARIOS PASSED

Test suite results

$ go test ./server/queue/... -count=1
ok  go.woodpecker-ci.org/woodpecker/v3/server/queue    5.294s

$ go test ./pipeline -count=1
ok  go.woodpecker-ci.org/woodpecker/v3/pipeline    0.007s

$ go test ./rpc/... -count=1
ok  go.woodpecker-ci.org/woodpecker/v3/rpc    0.020s

$ go test ./server/pipeline/... -count=1
ok  go.woodpecker-ci.org/woodpecker/v3/server/pipeline    0.019s
ok  go.woodpecker-ci.org/woodpecker/v3/server/pipeline/stepbuilder    0.106s

All tests pass including TestFifoCancel, TestFifoWait, TestFifoErrors, and all server pipeline tests.

System status verification

$ curl -s http://localhost:8000/version
{"source":"https://github.com/woodpecker-ci/woodpecker","version":"next-860fc34c1d"}

$ # Agent registered and polling
Agent 2: backend=local, platform=linux/amd64, version=next-860fc34c1d, capacity=1
Workers: 1, Pending: 0, Running: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants