fix: prevent keepalive from stopping without fixing CI failures by stranske · Pull Request #1651 · stranske/Workflows

stranske · 2026-02-24T17:51:10Z

Source: Issue #235

Automated Status Summary

Scope

Scope section missing from source issue.

Tasks

Tasks section missing from source issue.

Acceptance criteria

Head SHA: 87ad246
Latest Runs: ❔ in progress — Agents PR meta manager
Required: gate: ⏸️ not started

Workflow / Job	Result	Logs
Agents PR meta manager	❔ in progress	View run

…loop - Use nullish coalescing (??) instead of logical OR (||) for tasksTotal in work-log table rows so that 0 displays as "0" instead of "?" - Use previousState?.iteration ?? iteration instead of bare iteration in rounds_without_task_completion recalculation to stay consistent with the "current persisted iteration" rule (line 2739-2741) Both fixes address review feedback from Copilot on Counter_Risk PR #234. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL

When all tasks were complete but Gate was failing (e.g., lint-ruff), the keepalive loop would stop after `complete_gate_failure_rounds` reached its max, without giving fix attempts a fair chance. Three interacting issues caused this: 1. The `complete-gate-failure-max` check fired BEFORE the fix classification logic in the decision tree, blocking fix attempts once the counter reached the max. 2. Transient gate states (cancelled) incremented the counter even though they don't represent actual fix failures, consuming the fix budget with infrastructure noise. 3. The `consecutive_fix_rounds` counter was reset on wait/stop actions, losing track of prior fix attempts. Changes: - Restructure evaluate decision tree: handle cancelled gates first (without consuming fix budget), then try fix before stopping when all tasks complete and gate failing - Only increment complete_gate_failure_rounds on actual agent execution rounds (fix/run/conflict), not on wait/skip/stop - Preserve consecutive_fix_rounds across wait/stop/defer actions (only reset on non-fix agent execution) - Increase default completeGateFailureMax from 2 to 3, allowing 2 fix attempts before stopping - Add 10 focused tests for counter behavior Fixes the issue seen in Counter_Risk PR #235 where the agent completed all 27 tasks but stopped with complete-gate-failure-max despite lint-ruff failures that could have been fixed. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL

stranske-keepalive · 2026-02-24T17:53:41Z

Automated Status Summary

Head SHA: 34eee18
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	93.12%
Baseline	85.00%
Delta	+8.12%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`src/cli_parser.py`	81.8%	4
`src/percentile_calculator.py`	95.0%	1
`src/aggregator.py`	95.0%	2
`src/__init__.py`	100.0%	0
`src/ndjson_parser.py`	100.0%	0

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

Scope section missing from source issue.

Tasks

Tasks section missing from source issue.

Acceptance criteria

Acceptance criteria section missing from source issue.
[ ]

stranske-keepalive · 2026-02-24T17:54:34Z

🤖 Keepalive Loop Status

PR #1651 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	0/34 complete
Timeout	45 min (default)
Timeout usage	3m elapsed (7%, 42m remaining)
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

stranske-keepalive · 2026-02-24T17:54:35Z

Keepalive Work Log (click to expand)

Time (UTC)	Agent	Action	Result	Files	Progress	Commit	Gate
2026-02-24 17:54:35	Codex	wait (missing-agent-label-transient)	skipped	—	3/8	—	failure
2026-02-24 18:19:37	Codex	wait (missing-agent-label-transient)	skipped	—	0/34	—	failure
2026-02-24 18:42:19	Codex	wait (missing-agent-label-transient)	skipped	—	0/34	—	success

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7b8adbff8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.github/scripts/keepalive_loop.js

templates/consumer-repo/.github/scripts/keepalive_loop.js

Copilot

Pull request overview

This PR fixes a critical issue where the keepalive loop stopped without attempting to fix CI failures when all tasks were complete but the gate was failing (as occurred in PR #235). The fix restructures the decision tree to dispatch fix attempts before stopping, prevents transient gate states from consuming the fix budget, and increases the default failure threshold to allow more fix attempts.

Changes:

Restructured evaluateKeepaliveLoop decision tree to attempt fixes before stopping when all tasks are complete but gate is failing
Updated counter logic to only increment complete_gate_failure_rounds on actual agent-execution rounds (fix/run/conflict), not on transient wait/skip/stop actions
Increased default completeGateFailureMax from 2 to 3, allowing 2 fix attempts instead of 1
Added comprehensive test coverage for counter behavior across different action types

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
tests/keepalive-gate-failure-counter.test.js	New focused test file validating counter increment/preserve/reset behavior for all action types (10 tests)
.github/scripts/keepalive_loop.js	Core logic changes: restructured decision tree in `evaluateKeepaliveLoop`, updated counter logic in both `evaluateKeepaliveLoop` and `updateKeepaliveLoopSummary`, increased default max from 2 to 3
templates/consumer-repo/.github/scripts/keepalive_loop.js	Template sync of changes (incomplete - missing critical decision tree restructuring in `evaluateKeepaliveLoop`)

templates/consumer-repo/.github/scripts/keepalive_loop.js

1. Fix infinite wait loop for non-fixable gate failures: remove `isAgentExecution` requirement from counter increment — the `gateActuallyFailed` check already filters transient states (cancelled/pending), so the counter advances on every genuine failure round regardless of action type. 2. Sync template keepalive_loop.js with all main file changes: - Restructured evaluate decision tree (cancelled → allComplete → remaining) - Updated counter logic (increment on actual failure, preserve on non-success) - Fix round preservation across wait/stop/defer actions - Default completeGateFailureMax 2 → 3 3. Add test for the non-fixable wait+failure scenario; update stop-action test expectation to match new counter semantics. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL

The cancelled gate test expected 'gate-cancelled' but the code now returns 'gate-cancelled-transient' for non-rate-limit cancellations. Updated the assertion to match. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL

claude added 2 commits February 24, 2026 16:28

Copilot AI review requested due to automatic review settings February 24, 2026 17:51

chore: sync template scripts

7e5a6df

Copilot started reviewing on behalf of stranske February 24, 2026 17:51 View session

chatgpt-codex-connector bot reviewed Feb 24, 2026

View reviewed changes

.github/scripts/keepalive_loop.js Outdated Show resolved Hide resolved

templates/consumer-repo/.github/scripts/keepalive_loop.js Show resolved Hide resolved

Copilot AI reviewed Feb 24, 2026

View reviewed changes

templates/consumer-repo/.github/scripts/keepalive_loop.js Show resolved Hide resolved

templates/consumer-repo/.github/scripts/keepalive_loop.js Show resolved Hide resolved

stranske temporarily deployed to agent-standard February 24, 2026 18:16 — with GitHub Actions Inactive

fix: update test to expect gate-cancelled-transient reason

87ad246

The cancelled gate test expected 'gate-cancelled' but the code now returns 'gate-cancelled-transient' for non-rate-limit cancellations. Updated the assertion to match. https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL

stranske temporarily deployed to agent-standard February 24, 2026 18:39 — with GitHub Actions Inactive

stranske merged commit 458ec28 into main Feb 24, 2026
1915 of 1942 checks passed

stranske deleted the claude/debug-keepalive-workflow-88W3o branch February 24, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent keepalive from stopping without fixing CI failures#1651

fix: prevent keepalive from stopping without fixing CI failures#1651
stranske merged 5 commits intomainfrom
claude/debug-keepalive-workflow-88W3o

stranske commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stranske commented Feb 24, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Scope

Tasks

Acceptance criteria

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Tasks

Acceptance criteria

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

stranske-keepalive bot commented Feb 24, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stranske commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading

stranske-keepalive bot commented Feb 24, 2026 •

edited by agents-workflows-bot bot

Loading