Skip to content

Agent belt for #227#228

Merged
stranske merged 18 commits intomainfrom
codex/issue-227
Feb 24, 2026
Merged

Agent belt for #227#228
stranske merged 18 commits intomainfrom
codex/issue-227

Conversation

@agents-workflows-bot
Copy link
Copy Markdown
Contributor

@agents-workflows-bot agents-workflows-bot bot commented Feb 23, 2026

Source: Issue #227

Automated Status Summary

Scope

PR #208 addressed issue #48 but verification identified concerns (verdict: FAIL). This follow-up addresses the remaining gaps: add a stable CLI entrypoint, implement a deterministic mapping diff report generator, ensure registry-first name resolution is wired into normalization/reconciliation with mapping source attribution, and add unit/integration tests plus documentation.

Context for Agent

Related Issues/PRs

Tasks

CLI Entrypoint

  • Add mapping_diff_report console script entrypoint to pyproject.toml under [project.scripts] pointing to <package>.cli.mapping_diff_report:main
  • Create src/<package>/cli/mapping_diff_report.py with argument parser that supports --help flag
  • Implement error handling that exits non-zero and writes single-line stderr message including config/name_registry.yml path when registry is missing or unreadable
  • Wire the CLI main function to call the report generator and write output to stdout
  • Add exit code logic to return zero on success and non-zero on fatal errors

Report Generator

  • Create src/<package>/reports/mapping_diff.py with a callable report generator function signature that accepts registry path and input sources
  • Implement registry loading logic within the report generator using existing config patterns
  • Implement input scanning logic that identifies unmapped names and fallback-mapped names from normalization and reconciliation sources
  • Implement deterministic UNMAPPED section generation that lists raw input names not present in registry
  • Implement deterministic FALLBACK_MAPPED section generation that lists input names resolved via fallback with their canonical names
  • Implement deterministic SUGGESTIONS section generation that provides canonical name suggestions for every unmapped entry using title-case transformation

Registry-First Resolution

  • Identify the specific normalization and reconciliation functions that perform counterparty name resolution
  • Modify the name resolution logic to consult the registry for direct canonical and alias lookups before applying hardcoded mappings
  • Extend the return type or object from name resolution functions to include a source field indicating registry or fallback origin
  • Update all call sites of the modified resolution functions to handle the new source field in the return value

Unit Tests

  • Add unit test in tests/test_mapping_diff_report_cli.py that verifies mapping_diff_report --help exits with status zero and prints usage text containing mapping_diff_report string
  • Add unit test that verifies missing config/name_registry.yml causes non-zero exit and single-line stderr message containing the registry path
  • Add unit test that verifies unreadable config/name_registry.yml causes non-zero exit and appropriate stderr message
  • Add unit test that verifies deterministic output against fixed fixtures contains all three required sections with expected content

Integration Tests

  • Add integration test in tests/test_normalization_registry_first.py using name_registry_before.yml that verifies at least one fixture input resolves via fallback and appears in FALLBACK_MAPPED section
  • Add integration test using name_registry_after.yml with same inputs that verifies previously fallback-mapped name now resolves via registry
  • Add integration test that captures logs when using name_registry_after.yml and asserts no warning messages contain the previously fallback-mapped raw name
  • Add integration test that verifies mapping_diff_report output changes between before and after registry states for the same input set

Fixtures & Documentation

  • Create tests/fixtures/name_registry_before.yml with at least one missing alias that will trigger fallback resolution
  • Create tests/fixtures/name_registry_after.yml with the previously missing alias added
  • Add normalization/reconciliation input fixtures required for integration tests
  • Update fixture loading in tests to use explicit fixture selection via temp working dir or dependency injection (not real config/name_registry.yml)
  • Update documentation (README.md or docs page) to include ordered workflow: (1) edit config/name_registry.yml, (2) run mapping_diff_report, (3) interpret UNMAPPED, FALLBACK_MAPPED, SUGGESTIONS sections

Acceptance criteria

CLI Entrypoint

  • pyproject.toml defines a [project.scripts] console entrypoint named mapping_diff_report
  • Running mapping_diff_report --help exits with status code 0 and prints usage text that includes the string mapping_diff_report
  • If config/name_registry.yml is missing or unreadable, mapping_diff_report exits non-zero and writes a single-line error message to stderr that includes config/name_registry.yml

Report Generator

  • src/<package>/reports/mapping_diff.py exists and can be imported without performing IO at import time
  • With fixed fixture registry + fixed fixture normalization/reconciliation inputs, mapping_diff_report output is deterministic and contains three labeled sections: UNMAPPED, FALLBACK_MAPPED, and SUGGESTIONS
  • The UNMAPPED section lists each input name not present in the registry fixture one per line and prints the raw input name exactly as encountered
  • The FALLBACK_MAPPED section lists each input name resolved by fallback logic (not registry alias) and includes both the raw input name and resolved canonical name on each line
  • The SUGGESTIONS section includes a non-empty suggested canonical name for every entry in UNMAPPED, and each suggestion line follows the format <raw_input_name> -> <suggested_canonical_name> where suggested_canonical_name is generated using title-case transformation

Registry-First Resolution

  • Normalization/reconciliation code consults the name registry before any hardcoded/fallback mappings and records mapping source as registry or fallback per mapped name
  • Integration scenario A: using tests/fixtures/name_registry_before.yml, at least one fixture input resolves via fallback and mapping_diff_report lists it under FALLBACK_MAPPED
  • Integration scenario B: using tests/fixtures/name_registry_after.yml (same inputs), the previously fallback-mapped name resolves via registry and does not appear in FALLBACK_MAPPED or UNMAPPED in mapping_diff_report output
  • When using tests/fixtures/name_registry_after.yml, the normalization/reconciliation run emits no warning log messages containing the previously fallback-mapped raw name

Documentation & Scope

  • A documentation file (README.md or a docs page) contains an explicit ordered workflow with the literal steps: (1) edit config/name_registry.yml, (2) run mapping_diff_report, (3) interpret UNMAPPED, FALLBACK_MAPPED, and SUGGESTIONS sections
  • The PR branch diff contains only files matching these patterns: src/<package>/cli/*, src/<package>/reports/*, src/<package>/name_registry.py, tests/test_*registry*.py, tests/test_*mapping_diff*.py, tests/fixtures/name_registry*.yml, config/name_registry.yml, pyproject.toml (scripts section only), README.md or docs/*.md

@agents-workflows-bot
Copy link
Copy Markdown
Contributor Author

Agent worker (codex) activated for branch codex/issue-227.

@codex start
Focus on this task first: task-01 — Add mapping_diff_report console script entrypoint to pyproject.toml under [project.scripts] pointing to <package>.cli.mapping_diff_report:main

Implement only this task in your first commit.
Ensure the code compiles and existing tests pass before moving on.
The keepalive loop will assign subsequent tasks after this one is complete.

@agents-workflows-bot
Copy link
Copy Markdown
Contributor Author

agents-workflows-bot bot commented Feb 23, 2026

🤖 Keepalive Loop Status

PR #228 | Agent: Codex | Iteration 5+9 🚀 extended

Current State

Metric Value
Iteration progress [##########] 5/5 5 base + 9 extended = 14 total
Action stop (tasks-complete)
Agent status ✅ ALL TASKS COMPLETE
Gate success
Tasks 42/42 complete
Timeout 45 min (default)
Timeout usage 7m elapsed (18%, 38m remaining)
Keepalive ✅ enabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | unknown |
| Suggested recovery | Capture logs and context; retry once and escalate if the issue persists. |

@stranske stranske added the agent:retry Retry agent keepalive loop label Feb 24, 2026
@stranske-keepalive stranske-keepalive bot removed the agent:retry Retry agent keepalive loop label Feb 24, 2026
@stranske stranske added the agent:retry Retry agent keepalive loop label Feb 24, 2026
@stranske stranske added agent:retry Retry agent keepalive loop and removed agent:retry Retry agent keepalive loop labels Feb 24, 2026
@stranske-keepalive stranske-keepalive bot removed the agent:retry Retry agent keepalive loop label Feb 24, 2026
@stranske stranske added agent:retry Retry agent keepalive loop and removed agent:retry Retry agent keepalive loop labels Feb 24, 2026
@stranske-keepalive stranske-keepalive bot removed the agent:retry Retry agent keepalive loop label Feb 24, 2026
@stranske stranske merged commit 8224105 into main Feb 24, 2026
38 checks passed
@stranske stranske added the verify:compare Runs verifier comparison mode after merge label Feb 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Provider Comparison Report

Provider Summary

Provider Model Verdict Confidence Summary
openai gpt-5.2 FAIL 83% Core requested functionality is largely present: a mapping_diff_report console script is added, CLI supports --help and prints deterministic UNMAPPED/FALLBACK_MAPPED/SUGGESTIONS sections, and count...
anthropic claude-sonnet-4-5-20250929 FAIL 95% FAIL - Critical scope violation and incomplete implementation. This PR conflates two separate features: (1) the name registry mapping diff report tool (issue #227) and (2) extensive PPT generation...
📋 Full Provider Details (click to expand)

openai

  • Model: gpt-5.2
  • Verdict: FAIL
  • Confidence: 83%
  • Scores:
    • Correctness: 6.0/10
    • Completeness: 4.0/10
    • Quality: 7.0/10
    • Testing: 7.0/10
    • Risks: 6.0/10
  • Summary: Core requested functionality is largely present: a mapping_diff_report console script is added, CLI supports --help and prints deterministic UNMAPPED/FALLBACK_MAPPED/SUGGESTIONS sections, and counterparty resolution is registry-first with source attribution used in reconciliation warnings. Tests cover help, missing/unreadable registry behavior, deterministic output, and before/after registry fixture scenarios. However, the PR fails the documented acceptance due to major scope violations (many unrelated pipeline/PPT/manifest changes outside allowed file patterns) and incomplete registry-first application (clearing house normalization not registry-backed). There are also runtime risks around cached registry lookups and exception-type assumptions for CLI error formatting.
  • Concerns:
    • Scope/AC violation: PR diff includes many unrelated pipeline/PPT/manifest files and tests (e.g., src/counter_risk/pipeline/, release.spec, multiple tests/pipeline/, .agents/*). Acceptance requires the branch diff contain only a narrow set of registry/report/CLI/test/doc files; this PR changes 34 files well beyond the allowed patterns.
    • Registry-first resolution is only implemented for counterparty normalization via resolve_counterparty(); clearing house normalization remains fallback-only. The acceptance criteria explicitly call for normalization/reconciliation name resolution to consult the registry first; the implementation does not extend this to normalize_clearing_house, leaving part of name resolution outside registry-first semantics.
    • Potential correctness risk: resolve_counterparty() uses an LRU-cached alias lookup keyed by resolved registry path string; if the registry file contents change at runtime (same path), cached values may become stale, causing mapping_diff_report and pipeline reconciliation to report outdated results unless the process restarts.
    • mapping_diff_report CLI error handling assumes generate_mapping_diff_report raises ValueError for missing/unreadable registry; this is dependent on load_name_registry behavior. If load_name_registry raises a different exception type for unreadable files (e.g., OSError/PermissionError), the CLI may exit with a traceback rather than the required single-line stderr message.

anthropic

  • Model: claude-sonnet-4-5-20250929
  • Verdict: FAIL
  • Confidence: 95%
  • Scores:
    • Correctness: 4.0/10
    • Completeness: 3.0/10
    • Quality: 6.0/10
    • Testing: 5.0/10
    • Risks: 4.0/10
  • Summary: FAIL - Critical scope violation and incomplete implementation. This PR conflates two separate features: (1) the name registry mapping diff report tool (issue [Follow-up] Implement a new CLI command (or add a console_scri (PR #208) #227) and (2) extensive PPT generation infrastructure (issue [Follow-up] Modify the PPT generation code to conditionally cr (PR #185) #222). The acceptance criteria explicitly restrict the PR to registry/report/CLI files, but 15+ files fall outside this scope including pipeline/manifest_schema.py, pipeline/ppt_validation.py, config.py PPT flags, and extensive PPT-focused tests. The .agents/ ledger files confirm this is work from two different issues merged into one PR. Even focusing only on the mapping diff report feature: the implementation is functionally incomplete. The required config/name_registry.yml file is not present in the diff, making registry-first resolution dependent on external state. The normalize_counterparty public API discards the source attribution that mapping_diff_report depends on. Testing gaps exist around CLI argument handling and deterministic output format verification. The PPT-related changes (manifest schema, validation, run folder outputs, config flags) represent a complete separate feature with 500+ lines of new code and tests that should have been submitted as a separate PR per the acceptance criteria's explicit file pattern restrictions. Recommendation: Split this PR into two separate PRs - one for issue [Follow-up] Implement a new CLI command (or add a console_scri (PR #208) #227 (mapping diff report) containing only the allowed file patterns, and one for issue [Follow-up] Modify the PPT generation code to conditionally cr (PR #185) #222 (PPT generation) containing the pipeline infrastructure changes. The mapping diff PR should add the missing config/name_registry.yml file and address the source attribution API design issue.
  • Concerns:
    • CRITICAL SCOPE VIOLATION: PR includes extensive PPT generation infrastructure (manifest_schema.py, ppt_validation.py, ppt_naming.py changes, run_folder_outputs.py, config.py PPT flags) that is completely unrelated to the name registry mapping diff report feature described in acceptance criteria
    • CRITICAL SCOPE VIOLATION: Acceptance criterion explicitly requires 'The PR branch diff contains only files matching these patterns: src//cli/, src//reports/, src//name_registry.py, tests/test_registry.py, tests/test_mapping_diff.py, tests/fixtures/name_registry*.yml, config/name_registry.yml, pyproject.toml (scripts section only), README.md or docs/*.md' - but PR includes 15+ files outside this scope including pipeline/manifest_schema.py, pipeline/ppt_validation.py, pipeline/ppt_naming.py, pipeline/run_folder_outputs.py, config.py changes beyond scripts, and extensive test files for PPT features
    • CRITICAL SCOPE VIOLATION: Two agent ledger files (.agents/issue-222-ledger.yml, .agents/issue-227-ledger.yml) suggest this PR conflates work from issue [Follow-up] Modify the PPT generation code to conditionally cr (PR #185) #222 (PPT generation) with issue [Follow-up] Implement a new CLI command (or add a console_scri (PR #208) #227 (mapping diff report)
    • Acceptance criterion violation: 'config/name_registry.yml' is listed as an allowed file but is NOT present in the diff - registry-first resolution depends on this file existing
    • Implementation concern: normalize.py uses lru_cache on _load_alias_lookup with string path parameter, which will cache based on string equality rather than resolved path identity - could cause cache misses for equivalent paths expressed differently
    • Implementation concern: resolve_counterparty returns NameResolution with source='unmapped' when no registry or fallback match exists, but normalize_counterparty (the public API) discards this information and returns the normalized name - callers cannot distinguish unmapped from registry-mapped names
    • Testing gap: No test verifies that mapping_diff_report CLI actually calls generate_mapping_diff_report with correct parameters from command-line arguments
    • Testing gap: No test verifies the deterministic output format requirement with all three sections present in a single run
    • Testing gap: Integration tests in test_normalization_registry_first.py verify resolve_counterparty behavior but do not verify the full pipeline integration with actual normalization/reconciliation data sources
    • Code quality: _iter_names_from_payload in mapping_diff.py has complex nested recursion with collect_strings flag that is difficult to reason about - the logic for when to collect strings vs recurse is not immediately clear
    • Documentation gap: README.md workflow is extremely terse (3 lines) and does not explain what UNMAPPED/FALLBACK_MAPPED/SUGGESTIONS sections mean or how to interpret them
    • Risk: The extensive PPT-related changes (71 lines in run.py, new validation module, manifest schema) introduce significant surface area for bugs unrelated to the stated PR objective

Agreement

  • Verdict: FAIL (all providers)
  • Completeness: scores within 1 point (avg 3.5/10, range 3.0-4.0)
  • Quality: scores within 1 point (avg 6.5/10, range 6.0-7.0)

Disagreement

Dimension openai anthropic
Correctness 6.0/10 4.0/10
Testing 7.0/10 5.0/10
Risks 6.0/10 4.0/10

Unique Insights

  • openai: Scope/AC violation: PR diff includes many unrelated pipeline/PPT/manifest files and tests (e.g., src/counter_risk/pipeline/, release.spec, multiple tests/pipeline/, .agents/*). Acceptance requires the branch diff contain only a narrow set of registry/report/CLI/test/doc files; this PR changes 34 files well beyond the allowed patterns.; Registry-first resolution is only implemented for counterparty normalization via resolve_counterparty(); clearing house normalization remains fallback-only. The acceptance criteria explicitly call for normalization/reconciliation name resolution to consult the registry first; the implementation does not extend this to normalize_clearing_house, leaving part of name resolution outside registry-first semantics.; Potential correctness risk: resolve_counterparty() uses an LRU-cached alias lookup keyed by resolved registry path string; if the registry file contents change at runtime (same path), cached values may become stale, causing mapping_diff_report and pipeline reconciliation to report outdated results unless the process restarts.; mapping_diff_report CLI error handling assumes generate_mapping_diff_report raises ValueError for missing/unreadable registry; this is dependent on load_name_registry behavior. If load_name_registry raises a different exception type for unreadable files (e.g., OSError/PermissionError), the CLI may exit with a traceback rather than the required single-line stderr message.
  • anthropic: CRITICAL SCOPE VIOLATION: PR includes extensive PPT generation infrastructure (manifest_schema.py, ppt_validation.py, ppt_naming.py changes, run_folder_outputs.py, config.py PPT flags) that is completely unrelated to the name registry mapping diff report feature described in acceptance criteria; CRITICAL SCOPE VIOLATION: Acceptance criterion explicitly requires 'The PR branch diff contains only files matching these patterns: src//cli/, src//reports/, src//name_registry.py, tests/test_registry.py, tests/test_mapping_diff.py, tests/fixtures/name_registry*.yml, config/name_registry.yml, pyproject.toml (scripts section only), README.md or docs/*.md' - but PR includes 15+ files outside this scope including pipeline/manifest_schema.py, pipeline/ppt_validation.py, pipeline/ppt_naming.py, pipeline/run_folder_outputs.py, config.py changes beyond scripts, and extensive test files for PPT features; CRITICAL SCOPE VIOLATION: Two agent ledger files (.agents/issue-222-ledger.yml, .agents/issue-227-ledger.yml) suggest this PR conflates work from issue [Follow-up] Modify the PPT generation code to conditionally cr (PR #185) #222 (PPT generation) with issue [Follow-up] Implement a new CLI command (or add a console_scri (PR #208) #227 (mapping diff report); Acceptance criterion violation: 'config/name_registry.yml' is listed as an allowed file but is NOT present in the diff - registry-first resolution depends on this file existing; Implementation concern: normalize.py uses lru_cache on _load_alias_lookup with string path parameter, which will cache based on string equality rather than resolved path identity - could cause cache misses for equivalent paths expressed differently; Implementation concern: resolve_counterparty returns NameResolution with source='unmapped' when no registry or fallback match exists, but normalize_counterparty (the public API) discards this information and returns the normalized name - callers cannot distinguish unmapped from registry-mapped names; Testing gap: No test verifies that mapping_diff_report CLI actually calls generate_mapping_diff_report with correct parameters from command-line arguments; Testing gap: No test verifies the deterministic output format requirement with all three sections present in a single run; Testing gap: Integration tests in test_normalization_registry_first.py verify resolve_counterparty behavior but do not verify the full pipeline integration with actual normalization/reconciliation data sources; Code quality: _iter_names_from_payload in mapping_diff.py has complex nested recursion with collect_strings flag that is difficult to reason about - the logic for when to collect strings vs recurse is not immediately clear; Documentation gap: README.md workflow is extremely terse (3 lines) and does not explain what UNMAPPED/FALLBACK_MAPPED/SUGGESTIONS sections mean or how to interpret them; Risk: The extensive PPT-related changes (71 lines in run.py, new validation module, manifest schema) introduce significant surface area for bugs unrelated to the stated PR objective

🔍 LangSmith Traces

stranske pushed a commit that referenced this pull request Feb 24, 2026
Addresses 6 root causes identified in PR #228 post-mortem where the
coding agent claimed 42/42 tasks complete when multiple acceptance
criteria were unmet:

Fix 1 - Require verification PASS before stopping:
  The stop decision now requires the verifier to return PASS. If
  verification fails, the agent is re-run to fix gaps (up to 2 attempts).
  Previously, verification was attempted once and ignored on failure.

Fix 2 - Raise confidence thresholds in analyzeTaskCompletion:
  Keyword match threshold raised from 0.35 to 0.50 for HIGH confidence.
  Now requires 2+ matching words (not just percentage) to avoid
  single-word false positives. fileMatch tightened to require 2+ keywords
  or explicit file references. commitMatch requires 2+ substantive words.

Fix 3 - Gate cascade logic for acceptance criteria:
  cascadeParentCheckboxes now detects acceptance criteria section headings
  and disables cascading within them. Each acceptance criterion must be
  independently checked — a checked parent no longer auto-checks children
  in acceptance sections.

Fix 5 - Different verifier context:
  Verification steps now switch to the alternate agent (codex→claude or
  claude→codex) to avoid the structural problem where the same model that
  produced the work also verifies it. Configurable via verifier_agent.

Fix 6 - Mechanical scope enforcement:
  New extractScopePatterns/validateScopeCompliance functions parse file
  patterns from the scope section and validate the PR diff against them.
  Scope violations block the tasks-complete stop decision. The verifier
  prompt now includes a mandatory Scope Check section.

Fix 7 - Separate task/acceptance criteria tracking:
  Tasks and acceptance criteria are now counted independently. The stop
  decision requires BOTH allTasksDone AND allCriteriaMet. Auto-reconciliation
  only operates on task checkboxes, never acceptance criteria.

Also fixes pre-existing duplicate fixAttemptMax declaration.

https://claude.ai/code/session_01VtzHmRoYTL2kcxaacDgSqQ
@stranske stranske mentioned this pull request Feb 25, 2026
88 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent:codex Assigns Codex agent to issue autofix Triggers autofix on PR from:codex verify:compare Runs verifier comparison mode after merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants