Skip to content

fix: prevent keepalive from stopping without fixing CI failures#1651

Merged
stranske merged 5 commits intomainfrom
claude/debug-keepalive-workflow-88W3o
Feb 24, 2026
Merged

fix: prevent keepalive from stopping without fixing CI failures#1651
stranske merged 5 commits intomainfrom
claude/debug-keepalive-workflow-88W3o

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Feb 24, 2026

Source: Issue #235

Automated Status Summary

Scope

  • Scope section missing from source issue.

Tasks

  • Tasks section missing from source issue.

Acceptance criteria

  • Acceptance criteria section missing from source issue.

  • Head SHA: cc5b478

  • Latest Runs: ✅ success — Gate

  • Required: gate: ✅ success

  • | Workflow / Job | Result | Logs |

  • |----------------|--------|------|

  • | Agents PR meta manager | ❔ in progress | View run |

  • | CI Autofix Loop | ✅ success | View run |

  • | Copilot code review | ✅ success | View run |

  • | Gate | ✅ success | View run |

  • | Health 40 Sweep | ✅ success | View run |

  • | Health 44 Gate Branch Protection | ✅ success | View run |

  • | Health 45 Agents Guard | ✅ success | View run |

  • | Health 50 Security Scan | ✅ success | View run |

  • | Maint 52 Validate Workflows | ✅ success | View run |

  • | PR 11 - Minimal invariant CI | ✅ success | View run |

  • | Selftest CI | ✅ success | View run |

  • Head SHA: beebaeb

  • Latest Runs: ⏳ queued — Gate

  • Required: gate: ⏳ queued

  • | Workflow / Job | Result | Logs |

  • |----------------|--------|------|

  • | Agents PR meta manager | ❔ in progress | View run |

  • | CI Autofix Loop | ❔ in progress | View run |

  • | Copilot code review | ❔ in progress | View run |

  • | Gate | ⏳ queued | View run |

  • | Health 40 Sweep | ✅ success | View run |

  • | Health 44 Gate Branch Protection | ❔ in progress | View run |

  • | Health 45 Agents Guard | ✅ success | View run |

  • | Health 50 Security Scan | ❔ in progress | View run |

  • | Maint 52 Validate Workflows | ✅ success | View run |

  • | PR 11 - Minimal invariant CI | ✅ success | View run |

  • | Selftest CI | ❔ in progress | View run |

Head SHA: 87ad246
Latest Runs: ❔ in progress — Agents PR meta manager
Required: gate: ⏸️ not started

Workflow / Job Result Logs
Agents PR meta manager ❔ in progress View run

…loop

- Use nullish coalescing (??) instead of logical OR (||) for tasksTotal
  in work-log table rows so that 0 displays as "0" instead of "?"
- Use previousState?.iteration ?? iteration instead of bare iteration
  in rounds_without_task_completion recalculation to stay consistent
  with the "current persisted iteration" rule (line 2739-2741)

Both fixes address review feedback from Copilot on Counter_Risk PR #234.

https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
When all tasks were complete but Gate was failing (e.g., lint-ruff),
the keepalive loop would stop after `complete_gate_failure_rounds`
reached its max, without giving fix attempts a fair chance. Three
interacting issues caused this:

1. The `complete-gate-failure-max` check fired BEFORE the fix
   classification logic in the decision tree, blocking fix attempts
   once the counter reached the max.

2. Transient gate states (cancelled) incremented the counter even
   though they don't represent actual fix failures, consuming the
   fix budget with infrastructure noise.

3. The `consecutive_fix_rounds` counter was reset on wait/stop
   actions, losing track of prior fix attempts.

Changes:
- Restructure evaluate decision tree: handle cancelled gates first
  (without consuming fix budget), then try fix before stopping when
  all tasks complete and gate failing
- Only increment complete_gate_failure_rounds on actual agent
  execution rounds (fix/run/conflict), not on wait/skip/stop
- Preserve consecutive_fix_rounds across wait/stop/defer actions
  (only reset on non-fix agent execution)
- Increase default completeGateFailureMax from 2 to 3, allowing
  2 fix attempts before stopping
- Add 10 focused tests for counter behavior

Fixes the issue seen in Counter_Risk PR #235 where the agent
completed all 27 tasks but stopped with complete-gate-failure-max
despite lint-ruff failures that could have been fixed.

https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
Copilot AI review requested due to automatic review settings February 24, 2026 17:51
@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 24, 2026

Automated Status Summary

Head SHA: 34eee18
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 93.12%
Baseline 85.00%
Delta +8.12%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
src/cli_parser.py 81.8% 4
src/percentile_calculator.py 95.0% 1
src/aggregator.py 95.0% 2
src/__init__.py 100.0% 0
src/ndjson_parser.py 100.0% 0

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

  • Scope section missing from source issue.

Tasks

  • Tasks section missing from source issue.

Acceptance criteria

  • Acceptance criteria section missing from source issue.
  • [ ]

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 24, 2026

🤖 Keepalive Loop Status

PR #1651 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/34 complete
Timeout 45 min (default)
Timeout usage 3m elapsed (7%, 42m remaining)
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 24, 2026

Keepalive Work Log (click to expand)
# Time (UTC) Agent Action Result Files Tasks Progress Commit Gate
0 2026-02-24 17:54:35 Codex wait (missing-agent-label-transient) skipped 0 3/8 failure
0 2026-02-24 18:19:37 Codex wait (missing-agent-label-transient) skipped 0 0/34 failure
0 2026-02-24 18:42:19 Codex wait (missing-agent-label-transient) skipped 0 0/34 success

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7b8adbff8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where the keepalive loop stopped without attempting to fix CI failures when all tasks were complete but the gate was failing (as occurred in PR #235). The fix restructures the decision tree to dispatch fix attempts before stopping, prevents transient gate states from consuming the fix budget, and increases the default failure threshold to allow more fix attempts.

Changes:

  • Restructured evaluateKeepaliveLoop decision tree to attempt fixes before stopping when all tasks are complete but gate is failing
  • Updated counter logic to only increment complete_gate_failure_rounds on actual agent-execution rounds (fix/run/conflict), not on transient wait/skip/stop actions
  • Increased default completeGateFailureMax from 2 to 3, allowing 2 fix attempts instead of 1
  • Added comprehensive test coverage for counter behavior across different action types

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/keepalive-gate-failure-counter.test.js New focused test file validating counter increment/preserve/reset behavior for all action types (10 tests)
.github/scripts/keepalive_loop.js Core logic changes: restructured decision tree in evaluateKeepaliveLoop, updated counter logic in both evaluateKeepaliveLoop and updateKeepaliveLoopSummary, increased default max from 2 to 3
templates/consumer-repo/.github/scripts/keepalive_loop.js Template sync of changes (incomplete - missing critical decision tree restructuring in evaluateKeepaliveLoop)

1. Fix infinite wait loop for non-fixable gate failures: remove
   `isAgentExecution` requirement from counter increment — the
   `gateActuallyFailed` check already filters transient states
   (cancelled/pending), so the counter advances on every genuine
   failure round regardless of action type.

2. Sync template keepalive_loop.js with all main file changes:
   - Restructured evaluate decision tree (cancelled → allComplete → remaining)
   - Updated counter logic (increment on actual failure, preserve on non-success)
   - Fix round preservation across wait/stop/defer actions
   - Default completeGateFailureMax 2 → 3

3. Add test for the non-fixable wait+failure scenario; update
   stop-action test expectation to match new counter semantics.

https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
The cancelled gate test expected 'gate-cancelled' but the code now
returns 'gate-cancelled-transient' for non-rate-limit cancellations.
Updated the assertion to match.

https://claude.ai/code/session_012WnYCcttvFEY3FETnhVcNL
@stranske stranske merged commit 458ec28 into main Feb 24, 2026
1915 of 1942 checks passed
@stranske stranske deleted the claude/debug-keepalive-workflow-88W3o branch February 24, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants