Conversation
GitHub Actions ::warning:: commands truncate/mangle multi-line content. Emit a short annotation message and print full npm stderr in a collapsible ::group:: instead, so logs stay readable. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
…ation fixes Mirror the main setup-api-client changes into the consumer-repo template to prevent template drift: - Exponential backoff retry (3 attempts, 5s/10s) for transient npm errors - --legacy-peer-deps fallback on first failure - Short ::warning:: annotations with full stderr in collapsible ::group:: - Pin lru-cache@10.4.3 (was ^10.0.0) https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
…feguard Three changes to reusable-codex-run.yml to prevent work loss on timeout: 1. Pre-timeout watchdog: A background timer fires 5 minutes before max_runtime_minutes, committing and pushing any uncommitted work so it survives the job cancellation. Killed automatically if Codex finishes before the timer fires. 2. Robust parser import: Replace sys.path-based import of codex_jsonl_parser with importlib.util.spec_from_file_location. Consumer repos (e.g. Counter_Risk) have their own tools/ package with __init__.py that shadows the Workflows tools/ on sys.path, causing "No module named 'tools.codex_jsonl_parser'". 3. Commit step always runs: Add if: always() to the "Commit and push changes" step so uncommitted work is captured even on non-zero exit codes (the watchdog handles timeout, this handles failures). https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Automated Status SummaryHead SHA: f0e3f1d
Coverage Overview
Coverage Trend
Top Coverage Hotspots (lowest coverage)
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScopeNo scope information available Tasks
Acceptance criteria
|
🤖 Keepalive Loop StatusPR #1659 | Agent: Codex | Iteration 0/5 Current State
🔍 Failure Classification| Error type | infrastructure | |
Keepalive Work Log (click to expand)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c36adc1781
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR introduces three defensive mechanisms to prevent work loss when Codex agent jobs timeout or fail, addressing the specific issue from run #2615 where 45 minutes of work was lost:
Changes:
- Pre-timeout watchdog that commits and pushes work 5 minutes before the job timeout
- Robust parser import using
importlib.utilto avoid consumer repo package shadowing - Always-run commit step (
if: always()) to capture work even when Codex exits with errors - Enhanced npm retry logic with exponential backoff and proper error logging using
::group::blocks
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.github/workflows/reusable-codex-run.yml |
Adds pre-timeout watchdog background process, robust parser import, watchdog-saved output, and if: always() on commit step |
.github/actions/setup-api-client/action.yml |
Improves npm error logging by using ::group:: blocks instead of interpolating multi-line errors into warnings |
templates/consumer-repo/.github/actions/setup-api-client/action.yml |
Mirrors the npm error logging improvements for consumer repos |
Comments suppressed due to low confidence (1)
.github/workflows/reusable-codex-run.yml:949
- The watchdog is only started if
WATCHDOG_DELAY > 60seconds (line 949), meaning it requiresmax_runtime_minutes > 6. While this is reasonable to avoid watchdogs that fire too soon, there's an edge case where jobs withmax_runtime_minutesbetween 6 and 11 minutes will have a watchdog that fires in less than 5 minutes (the intended GRACE_MIN).
For example, with max_runtime_minutes = 7, the watchdog fires after (7 - 5) * 60 = 120 seconds (2 minutes), giving only 5 minutes remaining, not the expected 5 minutes of grace period. This could be confusing since the warning message states "5m before 7m limit" but actually fires at 2m.
Consider either:
- Adjusting the threshold check to ensure at least a minimum useful grace period (e.g.,
if [ "$WATCHDOG_DELAY" -gt 300 ]for 5-minute minimum fire time) - Or documenting that jobs shorter than ~11 minutes will have proportionally shorter grace periods
WATCHDOG_DELAY=$(( (MAX_RUNTIME_MIN - GRACE_MIN) * 60 ))
if [ "$WATCHDOG_DELAY" -gt 60 ]; then
parseCheckboxStates() and mergeCheckboxStates() only matched top-level checkboxes (^- \[), ignoring indented sub-tasks ( - \[). When PR Meta regenerated the PR body from the issue, auto-reconciled sub-task checkboxes were silently reverted to unchecked. This caused the keepalive loop to stall with rounds_without_task_completion: 8 despite the agent completing real work — PR #256 had 5 tasks auto-checked then immediately un-checked on every push. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
- P1: Add fetch/rebase before watchdog push to avoid non-fast-forward rejection when another workflow updates the branch during the run. Includes one retry with re-fetch/rebase and merge fallback. - P2: Export watchdog-saved in on.workflow_call.outputs so callers of the reusable workflow can observe the signal. - Copilot: Add git fetch before checking FETCH_HEAD to ensure it exists and is current (actions/checkout doesn't set FETCH_HEAD). - Copilot: Initialize watchdog-saved=false before background subshell so downstream consumers always get a defined value. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Update WORKFLOW_OUTPUTS.md to include the new watchdog-saved output from reusable-codex-run.yml, fixing the test_reusable_workflow_outputs_documented test. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
The body scan in extractIssueNumberFromPull was treating patterns like "Run #2615" as issue references, causing the Upsert PR body sections check to fail with a 404 when trying to fetch non-existent issues. Add a preceding-word filter to skip #NNN when preceded by common non-issue words (run, attempt, step, job, check, task, version, v). Add 12 unit tests covering the extraction logic. https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Summary
max_runtime_minutes, committing and pushing any uncommitted work so it survives job cancellation. Automatically killed if Codex finishes normally.sys.path-basedfrom tools.codex_jsonl_parserwithimportlib.util.spec_from_file_locationto avoid consumer repotools/package shadowing the Workflows parser.if: always(): The commit step now runs even on non-zero exit codes, capturing uncommitted work on failures.watchdog-savedoutput: Downstream jobs can detect when the watchdog saved work before timeout.Context
Run #2615 timed out after 45 minutes with zero commits pushed. The Codex agent's work was entirely lost because the commit step never ran. This PR ensures work is preserved regardless of how the job terminates.
Test plan
max_runtime_minutesis exceededcodex_jsonl_parserimport succeeds in consumer repos with their owntools/packagewatchdog-savedoutput propagates to downstream jobshttps://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6