fix: prevent keepalive from stopping without fixing CI failures#1651
fix: prevent keepalive from stopping without fixing CI failures#1651
Conversation
…loop - Use nullish coalescing (??) instead of logical OR (||) for tasksTotal in work-log table rows so that 0 displays as "0" instead of "?" - Use previousState?.iteration ?? iteration instead of bare iteration in rounds_without_task_completion recalculation to stay consistent with the "current persisted iteration" rule (line 2739-2741) Both fixes address review feedback from Copilot on Counter_Risk PR #234. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
When all tasks were complete but Gate was failing (e.g., lint-ruff), the keepalive loop would stop after `complete_gate_failure_rounds` reached its max, without giving fix attempts a fair chance. Three interacting issues caused this: 1. The `complete-gate-failure-max` check fired BEFORE the fix classification logic in the decision tree, blocking fix attempts once the counter reached the max. 2. Transient gate states (cancelled) incremented the counter even though they don't represent actual fix failures, consuming the fix budget with infrastructure noise. 3. The `consecutive_fix_rounds` counter was reset on wait/stop actions, losing track of prior fix attempts. Changes: - Restructure evaluate decision tree: handle cancelled gates first (without consuming fix budget), then try fix before stopping when all tasks complete and gate failing - Only increment complete_gate_failure_rounds on actual agent execution rounds (fix/run/conflict), not on wait/skip/stop - Preserve consecutive_fix_rounds across wait/stop/defer actions (only reset on non-fix agent execution) - Increase default completeGateFailureMax from 2 to 3, allowing 2 fix attempts before stopping - Add 10 focused tests for counter behavior Fixes the issue seen in Counter_Risk PR #235 where the agent completed all 27 tasks but stopped with complete-gate-failure-max despite lint-ruff failures that could have been fixed. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
Automated Status SummaryHead SHA: 34eee18
Coverage Overview
Coverage Trend
Top Coverage Hotspots (lowest coverage)
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScope
Tasks
Acceptance criteria
|
🤖 Keepalive Loop StatusPR #1651 | Agent: Codex | Iteration 0/5 Current State
🔍 Failure Classification| Error type | infrastructure | |
Keepalive Work Log (click to expand)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7b8adbff8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical issue where the keepalive loop stopped without attempting to fix CI failures when all tasks were complete but the gate was failing (as occurred in PR #235). The fix restructures the decision tree to dispatch fix attempts before stopping, prevents transient gate states from consuming the fix budget, and increases the default failure threshold to allow more fix attempts.
Changes:
- Restructured
evaluateKeepaliveLoopdecision tree to attempt fixes before stopping when all tasks are complete but gate is failing - Updated counter logic to only increment
complete_gate_failure_roundson actual agent-execution rounds (fix/run/conflict), not on transient wait/skip/stop actions - Increased default
completeGateFailureMaxfrom 2 to 3, allowing 2 fix attempts instead of 1 - Added comprehensive test coverage for counter behavior across different action types
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/keepalive-gate-failure-counter.test.js | New focused test file validating counter increment/preserve/reset behavior for all action types (10 tests) |
| .github/scripts/keepalive_loop.js | Core logic changes: restructured decision tree in evaluateKeepaliveLoop, updated counter logic in both evaluateKeepaliveLoop and updateKeepaliveLoopSummary, increased default max from 2 to 3 |
| templates/consumer-repo/.github/scripts/keepalive_loop.js | Template sync of changes (incomplete - missing critical decision tree restructuring in evaluateKeepaliveLoop) |
1. Fix infinite wait loop for non-fixable gate failures: remove `isAgentExecution` requirement from counter increment — the `gateActuallyFailed` check already filters transient states (cancelled/pending), so the counter advances on every genuine failure round regardless of action type. 2. Sync template keepalive_loop.js with all main file changes: - Restructured evaluate decision tree (cancelled → allComplete → remaining) - Updated counter logic (increment on actual failure, preserve on non-success) - Fix round preservation across wait/stop/defer actions - Default completeGateFailureMax 2 → 3 3. Add test for the non-fixable wait+failure scenario; update stop-action test expectation to match new counter semantics. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
The cancelled gate test expected 'gate-cancelled' but the code now returns 'gate-cancelled-transient' for non-rate-limit cancellations. Updated the assertion to match. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
Automated Status Summary
Scope
Tasks
Acceptance criteria
Acceptance criteria section missing from source issue.
Head SHA: cc5b478
Latest Runs: ✅ success — Gate
Required: gate: ✅ success
| Workflow / Job | Result | Logs |
|----------------|--------|------|
| Agents PR meta manager | ❔ in progress | View run |
| CI Autofix Loop | ✅ success | View run |
| Copilot code review | ✅ success | View run |
| Gate | ✅ success | View run |
| Health 40 Sweep | ✅ success | View run |
| Health 44 Gate Branch Protection | ✅ success | View run |
| Health 45 Agents Guard | ✅ success | View run |
| Health 50 Security Scan | ✅ success | View run |
| Maint 52 Validate Workflows | ✅ success | View run |
| PR 11 - Minimal invariant CI | ✅ success | View run |
| Selftest CI | ✅ success | View run |
Head SHA: beebaeb
Latest Runs: ⏳ queued — Gate
Required: gate: ⏳ queued
| Workflow / Job | Result | Logs |
|----------------|--------|------|
| Agents PR meta manager | ❔ in progress | View run |
| CI Autofix Loop | ❔ in progress | View run |
| Copilot code review | ❔ in progress | View run |
| Gate | ⏳ queued | View run |
| Health 40 Sweep | ✅ success | View run |
| Health 44 Gate Branch Protection | ❔ in progress | View run |
| Health 45 Agents Guard | ✅ success | View run |
| Health 50 Security Scan | ❔ in progress | View run |
| Maint 52 Validate Workflows | ✅ success | View run |
| PR 11 - Minimal invariant CI | ✅ success | View run |
| Selftest CI | ❔ in progress | View run |
Head SHA: 87ad246
Latest Runs: ❔ in progress — Agents PR meta manager
Required: gate: ⏸️ not started