Skip to content

fix: detect saved work on cancelled keepalive runs#255

Merged
stranske merged 4 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT
Feb 26, 2026
Merged

fix: detect saved work on cancelled keepalive runs#255
stranske merged 4 commits intomainfrom
claude/fix-task-completion-concerns-I1gRT

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Feb 25, 2026

Source: Issue #249

Automated Status Summary

Scope

PR #228 (issue #227) was merged with all 42 task checkboxes marked complete and the in-process verifier returning PASS. However, post-merge verification by both OpenAI (gpt-5.2, 83% confidence) and Anthropic (claude-sonnet-4-5, 95% confidence) returned FAIL, identifying concrete gaps in the implementation. This follow-up addresses the remaining unmet acceptance criteria.

Root Cause Analysis

The keepalive loop logs reveal a cascading failure across multiple systems:

  1. Codex agent checked PR body checkboxes aggressively. The .agents/issue-227-ledger.yml file shows only task-01 (of 42) was marked done in the structured ledger. Yet the PR body went from 42 unchecked → 0 unchecked over 14 iterations. The agent edited the PR body checkboxes directly without updating the ledger, bypassing the structured tracking that was designed to prevent exactly this.

  2. autoReconcileTasks amplified the problem. Each keepalive iteration ran an LLM-based auto-reconciliation step that auto-checked ~1 additional task per run based on loose commit-to-task matching. Over 14 iterations this added up. The reconciler used "high-confidence" matching that was insufficiently discriminating — e.g., a commit touching mapping_diff.py matched multiple task descriptions simultaneously.

  3. cascadeParentCheckboxes may have inflated counts. The keepalive loop's cascade logic automatically checks all indented child checkboxes when a parent is checked. If the agent or reconciler checked a section-level parent, all sub-tasks under it would cascade to checked.

  4. Codex-as-verifier was self-grading. Iteration 14 ran a verify-acceptance prompt using the same Codex agent that did the implementation work. Despite the prompt explicitly saying "Do NOT trust checkbox states as evidence of completion," the verifier returned success. This is a known LLM bias: the agent that produced the work is predisposed to confirm its own output.

  5. The keepalive loop trusted checkbox state as the primary completion signal. When all 42 boxes were checked AND the verifier returned success AND the CI gate was green, the loop issued stop (tasks-complete). There was no independent mechanical verification of the acceptance criteria.

  6. Scope violation was not mechanically enforced. The acceptance criterion "PR diff contains only files matching these patterns" was a text checkbox, not an automated check. The agent could (and did) check it off despite the diff containing 15+ out-of-scope files.

Context for Agent

Related Issues/PRs

Context for Agent

Related Issues/PRs

Tasks

Registry-First Resolution for Clearing House

  • Create resolve_clearing_house() function in src/counter_risk/normalize.py that returns NameResolution with source field using registry-first lookup before _CLEARING_HOUSE_FALLBACK_MAPPINGS
  • Update normalize_clearing_house() in src/counter_risk/normalize.py to delegate internally to resolve_clearing_house()
  • Add test in tests/test_normalization_registry_first.py verifying resolve_clearing_house() returns source='registry' when name exists in registry
  • Add test in tests/test_normalization_registry_first.py verifying resolve_clearing_house() returns source='fallback' when name is not in registry
  • Add test in tests/test_normalization_registry_first.py verifying resolve_clearing_house() consults registry before checking fallback mappings
  • Add test in tests/test_normalization_registry_first.py verifying resolve_clearing_house() handles missing or empty registry files without raising exceptions

Public API Source Attribution

  • Create normalize_counterparty_with_source() function in src/counter_risk/normalize.py that wraps resolve_counterparty() and returns NameResolution
  • Update docstring for normalize_counterparty_with_source() documenting the source field in returned NameResolution object
  • Update pipeline/run.py reconciliation logic to call normalize_counterparty_with_source() instead of resolve_counterparty() directly
  • Add unit test in tests/test_normalization_registry_first.py verifying normalize_counterparty_with_source() returns object with accessible .source attribute

Missing Config File

  • Create config/name_registry.yml with minimal valid registry containing at least the entries used by existing test fixtures
  • Add smoke test in tests/test_mapping_diff_report_cli.py that imports mapping_diff_report CLI and runs it against config/name_registry.yml without raising exceptions

Testing Gaps - CLI Parameter Passing

  • Add unit test in tests/test_mapping_diff_report_cli.py that mocks generate_mapping_diff_report and verifies CLI forwards correct registry_path parameter
  • Add unit test in tests/test_mapping_diff_report_cli.py that mocks generate_mapping_diff_report and verifies CLI forwards correct output_format parameter

Testing Gaps - End-to-End Report Sections

  • Create test fixture file tests/fixtures/unmapped_names.csv with known unmapped counterparty names
  • Create test fixture file tests/fixtures/fallback_mapped_names.csv with known fallback-mapped counterparty names
  • Add integration test in tests/test_mapping_diff_report_cli.py that runs mapping_diff_report CLI with fixtures and captures stdout
  • Add assertion in integration test verifying UNMAPPED section header appears in captured output
  • Add assertion in integration test verifying FALLBACK_MAPPED section header appears in captured output
  • Add assertion in integration test verifying SUGGESTIONS section header appears in captured output

Testing Gaps - Pipeline Integration

  • Create fixture file tests/fixtures/name_registry_before.yml representing initial registry state for pipeline testing
  • Create fixture file tests/fixtures/name_registry_after.yml representing updated registry state with additional mappings
  • Add integration test in tests/test_normalization_registry_first.py that loads pipeline with name_registry_before.yml
  • Add assertion verifying pipeline reconciliation uses registry-first resolution by checking NameResolution.source values
  • Add assertion verifying pipeline produces different NameResolution.source values when run with name_registry_after.yml

Acceptance criteria

Registry-First Resolution

  • normalize_clearing_house() consults the name registry before _CLEARING_HOUSE_FALLBACK_MAPPINGS and the resolution path records source as registry or fallback
  • resolve_clearing_house() exists in src/counter_risk/normalize.py and returns a NameResolution object with source field set to registry or fallback
  • Calling normalize_counterparty_with_source().source succeeds without AttributeError

Config File

  • config/name_registry.yml exists in repository root
  • config/name_registry.yml parses without YAML syntax errors
  • load_name_registry('config/name_registry.yml') executes without raising exceptions

Testing - CLI

  • Running mapping_diff_report --registry config/name_registry.yml completes with exit code 0
  • Unit test in tests/test_mapping_diff_report_cli.py mocks generate_mapping_diff_report and asserts it receives expected parameters
  • All tests in tests/test_mapping_diff_report_cli.py pass

Testing - Report Sections

  • Integration test runs mapping_diff_report CLI with fixtures and captures output to string
  • Test assertions verify string contains UNMAPPED substring
  • Test assertions verify string contains FALLBACK_MAPPED substring
  • Test assertions verify string contains SUGGESTIONS substring

Testing - Pipeline Integration

  • Integration test in tests/test_normalization_registry_first.py runs pipeline reconciliation with two different registry files
  • Test assertions verify at least one NameResolution.source value differs between the two pipeline runs
  • All tests in tests/test_normalization_registry_first.py pass

Scope Constraint

  • PR diff contains only files matching these patterns: src/counter_risk/normalize.py, src/counter_risk/cli/*, src/counter_risk/reports/*, config/name_registry.yml, tests/test_*registry*.py, tests/test_*mapping_diff*.py, tests/fixtures/*.yml, tests/fixtures/*.csv, pyproject.toml, README.md, docs/*.md

Overall

  • All new tests pass in CI
  • All existing tests continue to pass in CI

Head SHA: 5982550
Latest Runs: ✅ success — Gate
Required: gate: ✅ success

Workflow / Job Result Logs
.github/workflows/autofix.yml ❌ failure View run
Agents PR Event Hub ⏭️ skipped View run
Agents PR Meta ❔ in progress View run
Dependabot Auto-merge ⏭️ skipped View run
Gate ✅ success View run
Health 45 Agents Guard ✅ success View run

When a keepalive run is cancelled (typically timeout), agent outputs are
lost. This adds branch-level commit detection: when runResult is
'cancelled' and agent outputs are empty, the summary function checks the
PR branch for recent commits (pre-timeout checkpoints, codex-keepalive
commits) to determine if work was saved.

If saved work is detected, the summary shows "Timed out (work saved)"
instead of "Cancelled" and informs the user that the next iteration will
continue from where the agent left off.

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
Copilot AI review requested due to automatic review settings February 25, 2026 21:58
@stranske-keepalive
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #255. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the keepalive loop summary logic so cancelled (typically timed-out) runs can detect whether the agent/watchdog saved work via recent commits on the PR branch, and surfaces that in the PR summary comment UX.

Changes:

  • Add branch commit inspection on cancelled runs to infer whether work was saved despite missing agent outputs.
  • Introduce a new summary reason (agent-run-cancelled-with-saved-work) and adjust the cancelled-run status messaging accordingly.
  • Add guidance text in the summary comment when saved work is detected.
Comments suppressed due to low confidence (1)

.github/scripts/keepalive_loop.js:3390

  • The savedWork signal is set when any recent commit matches (pre-timeout checkpoint, codex-keepalive, or apply updates), but the note text claims specifically that the pre-timeout watchdog committed work. This can be misleading when the matching commit was created by the agent itself. Consider rewording the note to be attribution-agnostic, or tightening detection to only watchdog-specific commit messages if you want to keep this wording.
            '**Note:** The pre-timeout watchdog committed work before the job was cancelled.',
            'The next keepalive iteration will continue from where the agent left off.',
          );

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 25, 2026

🤖 Keepalive Loop Status

PR #255 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 42/44 complete
Timeout 45 min (default)
Timeout usage 7m elapsed (17%, 38m remaining)
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@stranske-keepalive
Copy link
Copy Markdown
Contributor

stranske-keepalive bot commented Feb 25, 2026

Keepalive Work Log (click to expand)
# Time (UTC) Agent Action Result Files Tasks Progress Commit Gate
0 2026-02-25 22:05:40 Codex wait (missing-agent-label-transient) skipped 0 0/6 success
0 2026-02-25 23:24:02 Codex wait (missing-agent-label-transient) skipped +4 42/44 failure
0 2026-02-26 00:29:15 Codex wait (missing-agent-label-transient) skipped 0 42/44 success

Three issues identified by independent verification of issue #239 work:

1. pipeline/run.py now calls normalize_counterparty_with_source() instead
   of resolve_counterparty() directly, fulfilling the public API source
   attribution task. The test monkeypatch is updated to match.

2. Added two missing tests for resolve_clearing_house() handling missing
   and empty registry files without raising exceptions.

3. Fixed test_mapping_diff_report_unreadable_registry_exits_nonzero which
   failed when running as root (chmod(0) is a no-op for root; the empty
   entries list triggers a validation error instead of a permission error).

All 21 tests in the target files pass. Full suite: 787 passed, 5 skipped
(6 pre-existing sitecustomize failures unrelated to these changes).

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
1. Remove unused resolve_counterparty import from pipeline/run.py (F401
   lint failure in Gate CI).

2. Fix fetchPullRequestCached call: use options object ({ github,
   context, prNumber, core }) instead of positional args. The previous
   call would always return null, preventing saved-work detection.

3. Use previousState.updated_at instead of current_iteration_at for the
   since parameter in branch commit detection. loadKeepaliveState calls
   applyIterationTracking which resets current_iteration_at to "now",
   making it useless as a time range bound for listCommits.

https://claude.ai/code/session_01JhCWWDJG8PqwaSbVPCGfm6
@stranske stranske merged commit 23e5647 into main Feb 26, 2026
487 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants