fix: pre-timeout watchdog, robust parser import, and always-commit safeguard by stranske · Pull Request #1659 · stranske/Workflows

stranske · 2026-02-25T21:58:21Z

Summary

Pre-timeout watchdog: A background timer fires 5 minutes before max_runtime_minutes, committing and pushing any uncommitted work so it survives job cancellation. Automatically killed if Codex finishes normally.
Robust parser import: Replaced sys.path-based from tools.codex_jsonl_parser with importlib.util.spec_from_file_location to avoid consumer repo tools/ package shadowing the Workflows parser.
Commit step if: always(): The commit step now runs even on non-zero exit codes, capturing uncommitted work on failures.
New watchdog-saved output: Downstream jobs can detect when the watchdog saved work before timeout.

Context

Run #2615 timed out after 45 minutes with zero commits pushed. The Codex agent's work was entirely lost because the commit step never ran. This PR ensures work is preserved regardless of how the job terminates.

Test plan

Verify pre-timeout watchdog fires correctly when max_runtime_minutes is exceeded
Verify watchdog is killed cleanly when Codex finishes before timeout
Verify codex_jsonl_parser import succeeds in consumer repos with their own tools/ package
Verify commit step runs on non-zero exit codes
Verify watchdog-saved output propagates to downstream jobs

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

GitHub Actions ::warning:: commands truncate/mangle multi-line content. Emit a short annotation message and print full npm stderr in a collapsible ::group:: instead, so logs stay readable. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

…ation fixes Mirror the main setup-api-client changes into the consumer-repo template to prevent template drift: - Exponential backoff retry (3 attempts, 5s/10s) for transient npm errors - --legacy-peer-deps fallback on first failure - Short ::warning:: annotations with full stderr in collapsible ::group:: - Pin lru-cache@10.4.3 (was ^10.0.0) https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

…feguard Three changes to reusable-codex-run.yml to prevent work loss on timeout: 1. Pre-timeout watchdog: A background timer fires 5 minutes before max_runtime_minutes, committing and pushing any uncommitted work so it survives the job cancellation. Killed automatically if Codex finishes before the timer fires. 2. Robust parser import: Replace sys.path-based import of codex_jsonl_parser with importlib.util.spec_from_file_location. Consumer repos (e.g. Counter_Risk) have their own tools/ package with __init__.py that shadows the Workflows tools/ on sys.path, causing "No module named 'tools.codex_jsonl_parser'". 3. Commit step always runs: Add if: always() to the "Commit and push changes" step so uncommitted work is captured even on non-zero exit codes (the watchdog handles timeout, this handles failures). https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

stranske-keepalive · 2026-02-25T22:01:06Z

Automated Status Summary

Head SHA: f0e3f1d
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	93.12%
Baseline	85.00%
Delta	+8.12%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`src/cli_parser.py`	81.8%	4
`src/percentile_calculator.py`	95.0%	1
`src/aggregator.py`	95.0%	2
`src/__init__.py`	100.0%	0
`src/ndjson_parser.py`	100.0%	0

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

No scope information available

Tasks

No tasks defined

Acceptance criteria

No acceptance criteria defined

stranske-keepalive · 2026-02-25T22:01:56Z

🤖 Keepalive Loop Status

PR #1659 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	0/9 complete
Timeout	45 min (default)
Timeout usage	3m elapsed (7%, 42m remaining)
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

stranske-keepalive · 2026-02-25T22:01:57Z

Keepalive Work Log (click to expand)

Time (UTC)	Agent	Action	Result	Files	Progress	Commit	Gate
2026-02-25 22:01:57	Codex	wait (missing-agent-label-transient)	skipped	—	0/9	—	success
2026-02-26 01:38:19	Codex	wait (missing-agent-label-transient)	skipped	—	0/9	—	success
2026-02-26 01:45:07	Codex	wait (missing-agent-label-transient)	skipped	—	0/9	—	failure
2026-02-26 01:49:16	Codex	wait (missing-agent-label-transient)	skipped	—	0/9	—	success
2026-02-26 02:07:37	Codex	wait (missing-agent-label-transient)	skipped	—	0/9	—	success

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c36adc1781

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.github/workflows/reusable-codex-run.yml

Copilot

Pull request overview

This PR introduces three defensive mechanisms to prevent work loss when Codex agent jobs timeout or fail, addressing the specific issue from run #2615 where 45 minutes of work was lost:

Changes:

Pre-timeout watchdog that commits and pushes work 5 minutes before the job timeout
Robust parser import using importlib.util to avoid consumer repo package shadowing
Always-run commit step (if: always()) to capture work even when Codex exits with errors
Enhanced npm retry logic with exponential backoff and proper error logging using ::group:: blocks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`.github/workflows/reusable-codex-run.yml`	Adds pre-timeout watchdog background process, robust parser import, watchdog-saved output, and `if: always()` on commit step
`.github/actions/setup-api-client/action.yml`	Improves npm error logging by using `::group::` blocks instead of interpolating multi-line errors into warnings
`templates/consumer-repo/.github/actions/setup-api-client/action.yml`	Mirrors the npm error logging improvements for consumer repos

Comments suppressed due to low confidence (1)

.github/workflows/reusable-codex-run.yml:949

The watchdog is only started if WATCHDOG_DELAY > 60 seconds (line 949), meaning it requires max_runtime_minutes > 6. While this is reasonable to avoid watchdogs that fire too soon, there's an edge case where jobs with max_runtime_minutes between 6 and 11 minutes will have a watchdog that fires in less than 5 minutes (the intended GRACE_MIN).

For example, with max_runtime_minutes = 7, the watchdog fires after (7 - 5) * 60 = 120 seconds (2 minutes), giving only 5 minutes remaining, not the expected 5 minutes of grace period. This could be confusing since the warning message states "5m before 7m limit" but actually fires at 2m.

Consider either:

Adjusting the threshold check to ensure at least a minimum useful grace period (e.g., if [ "$WATCHDOG_DELAY" -gt 300 ] for 5-minute minimum fire time)
Or documenting that jobs shorter than ~11 minutes will have proportionally shorter grace periods

          WATCHDOG_DELAY=$(( (MAX_RUNTIME_MIN - GRACE_MIN) * 60 ))
          if [ "$WATCHDOG_DELAY" -gt 60 ]; then

.github/workflows/reusable-codex-run.yml

parseCheckboxStates() and mergeCheckboxStates() only matched top-level checkboxes (^- \[), ignoring indented sub-tasks ( - \[). When PR Meta regenerated the PR body from the issue, auto-reconciled sub-task checkboxes were silently reverted to unchecked. This caused the keepalive loop to stall with rounds_without_task_completion: 8 despite the agent completing real work — PR #256 had 5 tasks auto-checked then immediately un-checked on every push. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

- P1: Add fetch/rebase before watchdog push to avoid non-fast-forward rejection when another workflow updates the branch during the run. Includes one retry with re-fetch/rebase and merge fallback. - P2: Export watchdog-saved in on.workflow_call.outputs so callers of the reusable workflow can observe the signal. - Copilot: Add git fetch before checking FETCH_HEAD to ensure it exists and is current (actions/checkout doesn't set FETCH_HEAD). - Copilot: Initialize watchdog-saved=false before background subshell so downstream consumers always get a defined value. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

Update WORKFLOW_OUTPUTS.md to include the new watchdog-saved output from reusable-codex-run.yml, fixing the test_reusable_workflow_outputs_documented test. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

The body scan in extractIssueNumberFromPull was treating patterns like "Run #2615" as issue references, causing the Upsert PR body sections check to fail with a 404 when trying to fetch non-existent issues. Add a preceding-word filter to skip #NNN when preceded by common non-issue words (run, attempt, step, job, check, task, version, v). Add 12 unit tests covering the extraction logic. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

stranske-keepalive · 2026-02-26T02:04:44Z

⚠️ Action Required: Unable to determine source issue for PR #1659. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

claude added 3 commits February 25, 2026 21:13

Copilot AI review requested due to automatic review settings February 25, 2026 21:58

stranske temporarily deployed to agent-high-privilege February 25, 2026 21:58 — with GitHub Actions Inactive

Copilot started reviewing on behalf of stranske February 25, 2026 21:58 View session

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

.github/workflows/reusable-codex-run.yml Outdated Show resolved Hide resolved

.github/workflows/reusable-codex-run.yml Show resolved Hide resolved

Copilot AI reviewed Feb 25, 2026

View reviewed changes

.github/workflows/reusable-codex-run.yml Show resolved Hide resolved

.github/workflows/reusable-codex-run.yml Show resolved Hide resolved

stranske mentioned this pull request Feb 25, 2026

fix: detect saved work on cancelled keepalive runs stranske/Counter_Risk#255

Merged

44 tasks

stranske temporarily deployed to agent-high-privilege February 26, 2026 01:35 — with GitHub Actions Inactive

github-actions bot and others added 2 commits February 26, 2026 01:35

chore: sync template scripts

6145c5e

stranske temporarily deployed to agent-high-privilege February 26, 2026 01:42 — with GitHub Actions Inactive

docs: add watchdog-saved to workflow outputs reference

0b9a46c

Update WORKFLOW_OUTPUTS.md to include the new watchdog-saved output from reusable-codex-run.yml, fixing the test_reusable_workflow_outputs_documented test. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6

stranske temporarily deployed to agent-high-privilege February 26, 2026 01:46 — with GitHub Actions Inactive

stranske temporarily deployed to agent-high-privilege February 26, 2026 02:04 — with GitHub Actions Inactive

stranske merged commit 610f5d6 into main Feb 26, 2026
40 checks passed

stranske deleted the claude/fix-task-completion-concerns-I1gRT branch February 26, 2026 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pre-timeout watchdog, robust parser import, and always-commit safeguard#1659

fix: pre-timeout watchdog, robust parser import, and always-commit safeguard#1659
stranske merged 8 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT

stranske commented Feb 25, 2026

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

stranske-keepalive bot commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stranske commented Feb 25, 2026

Summary

Context

Test plan

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Tasks

Acceptance criteria

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

stranske-keepalive bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

stranske-keepalive bot commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading

stranske-keepalive bot commented Feb 25, 2026 •

edited

Loading