Fix auto-pilot stall when belt dispatcher is cancelled by stranske · Pull Request #1665 · stranske/Workflows

stranske · 2026-02-26T15:10:31Z

Summary

Adds retry logic (up to 3 attempts with 15s verification) to the capability-check belt dispatch so a single cancelled run doesn't strand an issue
Adds re-dispatch in the branch-check backoff loop so the auto-pilot can self-heal instead of passively waiting and then stalling

Context

Observed on stranske/Counter_Risk#34: the auto-pilot dispatched the belt, but the GitHub Actions run was silently cancelled before receiving a runner (empty logs, null cancel_actor). No retry existed, so the keepalive loop waited through 5 exponential-backoff cycles (~30 min total), then stalled with needs-human.

Changes

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml:

Capability-check step (~line 1889): Single createWorkflowDispatch call replaced with a 3-attempt loop. After each dispatch, waits 15s and checks listWorkflowRuns for a queued/in_progress run. Retries if the run was cancelled.
Branch-check backoff loop (~line 2298): Before sleeping, checks for an active dispatcher run. If none is found, re-dispatches the belt. Logs the re-dispatch in the issue comment.

Test plan

YAML validates (yaml.safe_load)
JS syntax passes (node --check in async wrapper)
Retry loop caps at 3 attempts and doesn't retry on the final attempt
Branch-check re-dispatch only fires when no dispatcher run is alive
Existing happy-path (dispatch succeeds first try) exits loop immediately

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

GitHub Actions ::warning:: commands truncate/mangle multi-line content. Emit a short annotation message and print full npm stderr in a collapsible ::group:: instead, so logs stay readable. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

…ation fixes Mirror the main setup-api-client changes into the consumer-repo template to prevent template drift: - Exponential backoff retry (3 attempts, 5s/10s) for transient npm errors - --legacy-peer-deps fallback on first failure - Short ::warning:: annotations with full stderr in collapsible ::group:: - Pin lru-cache@10.4.3 (was ^10.0.0) https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

…feguard Three changes to reusable-codex-run.yml to prevent work loss on timeout: 1. Pre-timeout watchdog: A background timer fires 5 minutes before max_runtime_minutes, committing and pushing any uncommitted work so it survives the job cancellation. Killed automatically if Codex finishes before the timer fires. 2. Robust parser import: Replace sys.path-based import of codex_jsonl_parser with importlib.util.spec_from_file_location. Consumer repos (e.g. Counter_Risk) have their own tools/ package with __init__.py that shadows the Workflows tools/ on sys.path, causing "No module named 'tools.codex_jsonl_parser'". 3. Commit step always runs: Add if: always() to the "Commit and push changes" step so uncommitted work is captured even on non-zero exit codes (the watchdog handles timeout, this handles failures). https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

parseCheckboxStates() and mergeCheckboxStates() only matched top-level checkboxes (^- \[), ignoring indented sub-tasks ( - \[). When PR Meta regenerated the PR body from the issue, auto-reconciled sub-task checkboxes were silently reverted to unchecked. This caused the keepalive loop to stall with rounds_without_task_completion: 8 despite the agent completing real work — PR #256 had 5 tasks auto-checked then immediately un-checked on every push. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

- P1: Add fetch/rebase before watchdog push to avoid non-fast-forward rejection when another workflow updates the branch during the run. Includes one retry with re-fetch/rebase and merge fallback. - P2: Export watchdog-saved in on.workflow_call.outputs so callers of the reusable workflow can observe the signal. - Copilot: Add git fetch before checking FETCH_HEAD to ensure it exists and is current (actions/checkout doesn't set FETCH_HEAD). - Copilot: Initialize watchdog-saved=false before background subshell so downstream consumers always get a defined value. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Update WORKFLOW_OUTPUTS.md to include the new watchdog-saved output from reusable-codex-run.yml, fixing the test_reusable_workflow_outputs_documented test. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

The body scan in extractIssueNumberFromPull was treating patterns like "Run #2615" as issue references, causing the Upsert PR body sections check to fail with a 404 when trying to fetch non-existent issues. Add a preceding-word filter to skip #NNN when preceded by common non-issue words (run, attempt, step, job, check, task, version, v). Add 12 unit tests covering the extraction logic. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

… to Claude runner Closes the three remaining feature gaps between the Claude and Codex runners identified in issue #1646: 1. **Session analysis (LLM-powered)**: Reuses analyze_codex_session.py which auto-detects Claude's plain-text session log (data_source=summary) and feeds it through the same LLM analysis pipeline for structured task completion assessment. Outputs feed into the keepalive loop. 2. **Completion checkpoint comment**: Posts a PR comment summarizing completed tasks and acceptance criteria using the shared post_completion_comment.js script. Supports both claude-prompt*.md and codex-prompt*.md file names. 3. **Error diagnostics**: Adds GITHUB_STEP_SUMMARY with error table, creates a diagnostics artifact (JSON + agent output), and posts a structured PR comment on non-transient failures with recovery guidance and log links. Uses a distinct  marker. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Claude runner (reusable-claude-run.yml): - Fix shell quoting of completed-tasks JSON by using env vars instead of inline ${{ }} expansion which breaks on apostrophes in task names - Declare OPENAI_API_KEY and CLAUDE_API_STRANSKE in workflow_call.secrets so callers can pass them (matches Codex runner) - Use printf instead of echo when writing PR body to disk to avoid mangling of -n/-e prefixes or backslashes - Add info log when falling back to codex-prompt file Codex runner (reusable-codex-run.yml): - Gate watchdog-saved=true on actual push success instead of emitting it unconditionally after push attempts that may have both failed - Use a fired-flag file so the watchdog kill only terminates the background process if it's still sleeping (hasn't started its commit/push work yet) https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

All four conflicts were in reusable-codex-run.yml watchdog code where our branch has the fired-flag and push-success-gating improvements vs the unchanged main version. Kept our (HEAD) version for all. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

- Remove "task" from the non-issue prefix filter in extractIssueNumberFromPull so "Task #123" is correctly treated as an issue reference (flagged by Codex on PAEM sync PR) - Make --legacy-peer-deps retry conditional on ERESOLVE/peer-dep errors instead of only firing on the first attempt (flagged by Copilot on TMP sync PR) - Add test for "Task #N" being treated as a valid issue ref https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

The label sync workflow (maint-69-sync-labels.yml) has been failing since Feb 2 because npm install -g js-yaml installs to the global prefix which actions/github-script can't resolve. Install locally so Node's module resolution finds it in node_modules/. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Two changes to prevent the issue stranske/Counter_Risk#34 scenario where a single cancelled belt dispatcher run strands an issue: 1. Capability-check step: dispatch the belt up to 3 times with a 15s verification window after each attempt. If the dispatched run is not queued/in_progress, retry. This catches silent cancellations before the auto-pilot moves on. 2. Branch-check loop: on the 2nd+ backoff iteration, check whether any belt dispatcher run is still active. If not, re-dispatch the belt before sleeping. This makes the loop self-healing instead of passively waiting for a run that was already cancelled. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

stranske-keepalive · 2026-02-26T15:10:48Z

⚠️ Action Required: Unable to determine source issue for PR #1665. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

agents-workflows-bot · 2026-02-26T15:13:08Z

Automated Status Summary

Head SHA: 7e006e6
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	93.12%
Baseline	85.00%
Delta	+8.12%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`src/cli_parser.py`	81.8%	4
`src/percentile_calculator.py`	95.0%	1
`src/aggregator.py`	95.0%	2
`src/__init__.py`	100.0%	0
`src/ndjson_parser.py`	100.0%	0

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

No scope information available

Tasks

No tasks defined

Acceptance criteria

No acceptance criteria defined

agents-workflows-bot · 2026-02-26T15:13:52Z

🤖 Keepalive Loop Status

PR #1665 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	0/7 complete
Timeout	45 min (default)
Timeout usage	3m elapsed (7%, 42m remaining)
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

agents-workflows-bot · 2026-02-26T15:13:52Z

Keepalive Work Log (click to expand)

Time (UTC)	Agent	Action	Result	Files	Progress	Commit	Gate
2026-02-26 15:13:52	Codex	wait (missing-agent-label-transient)	skipped	—	0/7	—	success
2026-02-26 16:31:43	Codex	wait (missing-agent-label-transient)	skipped	—	0/7	—	success
2026-02-26 18:21:42	Codex	wait (missing-agent-label-transient)	skipped	—	0/7	—	success

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a11836372

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml

Copilot

Pull request overview

This PR addresses auto-pilot stalls caused by silently cancelled belt dispatcher runs. It adds retry logic with verification to the capability-check dispatch step, and implements self-healing re-dispatch in the branch-check backoff loop.

Changes:

Add 3-attempt retry loop with 15-second verification for belt dispatcher in capability-check step
Add re-dispatch logic in branch-check backoff loop to recover from cancelled dispatcher runs
Improve npm peer dependency handling in setup-api-client to detect specific error patterns
Remove "task" from issue number extraction skip list to allow "Task #N" references
Enhance Claude runner with LLM analysis, completion checkpoints, and error diagnostics
Improve Codex watchdog to avoid race conditions when killing watchdog processes
Fix npm install in maint-69-sync-labels.yml to use local instead of global installation

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
templates/consumer-repo/.github/workflows/agents-auto-pilot.yml	Adds retry/verification for capability-check dispatch and re-dispatch in branch-check loop
.github/workflows/reusable-codex-run.yml	Improves watchdog flag file handling to prevent race conditions during timeout
.github/workflows/reusable-claude-run.yml	Adds LLM task analysis, completion checkpoint comments, and enhanced error reporting
.github/actions/setup-api-client/action.yml	Smarter peer dependency conflict detection using error pattern matching
templates/consumer-repo/.github/actions/setup-api-client/action.yml	Same peer dependency handling improvement (synced)
.github/scripts/agents_pr_meta_keepalive.js	Removes "task" from skip list to allow "Task #N" issue references
templates/consumer-repo/.github/scripts/agents_pr_meta_keepalive.js	Same change (synced)
.github/scripts/tests/agents-pr-meta-keepalive.test.js	Adds test verifying "Task #N" is treated as valid issue reference
.github/workflows/maint-69-sync-labels.yml	Changes npm install from global to local installation

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml

.github/workflows/maint-69-sync-labels.yml

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml

After a progress-review action, rounds_without_task_completion was never reset. The next keepalive trigger would re-evaluate, find the counter still above threshold, enter review again, increment the counter, and repeat — permanently trapping the loop in review mode with no agent work ever running again. This affected all 4 agent PRs (#266, #267, #268, #269) which each stalled at progress-review-N with uncompleted tasks. Fix: 1. keepalive_loop.js summary function: reset rounds_without_task_completion to 0 after a review action, so the next evaluate triggers a run instead of another review. The review already provided course-correction feedback — the agent needs a chance to act on it. 2. agents-keepalive-loop.yml: add progress-review as a dependency of the summary job so the state update waits for the review to complete before persisting the reset counter. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Fixes from inline code review on PR #1665: 1. Scope dispatcher run verification to the current dispatch by filtering runs created after the dispatch timestamp (dispatchedAt). Previously an old successful run for a different issue would falsely satisfy the check, causing retries to stop early. 2. Verify all dispatch attempts including the final one. Previously the last attempt assumed success without checking, creating a false-positive path when the last run was also cancelled. 3. On verification errors (catch block), continue to the next retry attempt instead of optimistically breaking out of the loop. Transient API errors no longer mask failed dispatches. 4. Scope the branch-check re-dispatch to recent runs (last 30 minutes) instead of any active run. An unrelated dispatcher run for a different issue no longer suppresses re-dispatch. 5. Apply all auto-pilot changes to both .github/workflows/ and templates/consumer-repo/.github/workflows/ per sync conventions. 6. Use --no-save --no-package-lock for npm install in maint-69-sync-labels.yml per repo conventions. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Keep --no-save --no-package-lock flags on npm install per repo conventions. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

claude and others added 14 commits February 25, 2026 21:13

chore: sync template scripts

6145c5e

docs: add watchdog-saved to workflow outputs reference

0b9a46c

Update WORKFLOW_OUTPUTS.md to include the new watchdog-saved output from reusable-codex-run.yml, fixing the test_reusable_workflow_outputs_documented test. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Copilot AI review requested due to automatic review settings February 26, 2026 15:10

stranske temporarily deployed to agent-high-privilege February 26, 2026 15:10 — with GitHub Actions Inactive

Copilot started reviewing on behalf of stranske February 26, 2026 15:11 View session

chatgpt-codex-connector bot reviewed Feb 26, 2026

View reviewed changes

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml Outdated Show resolved Hide resolved

templates/consumer-repo/.github/workflows/agents-auto-pilot.yml Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 26, 2026

View reviewed changes

stranske temporarily deployed to agent-high-privilege February 26, 2026 16:28 — with GitHub Actions Inactive

claude added 2 commits February 26, 2026 16:48

Merge main: resolve conflict in maint-69-sync-labels.yml

7874710

Keep --no-save --no-package-lock flags on npm install per repo conventions. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

stranske temporarily deployed to agent-high-privilege February 26, 2026 18:18 — with GitHub Actions Inactive

stranske merged commit b848d60 into main Feb 26, 2026
40 checks passed

stranske deleted the claude/fix-task-completion-concerns-I1gRT branch February 26, 2026 18:25

This was referenced Feb 26, 2026

Follow-up: address sync PR review feedback #1666

Closed

Tighten dispatch verification and drop unnecessary summary dep #1667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix auto-pilot stall when belt dispatcher is cancelled#1665

Fix auto-pilot stall when belt dispatcher is cancelled#1665
stranske merged 17 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT

stranske commented Feb 26, 2026

Uh oh!

stranske-keepalive bot commented Feb 26, 2026

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stranske commented Feb 26, 2026

Summary

Context

Changes

Test plan

Uh oh!

stranske-keepalive bot commented Feb 26, 2026

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Tasks

Acceptance criteria

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

agents-workflows-bot bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading

agents-workflows-bot bot commented Feb 26, 2026 •

edited

Loading