Skip to content

v0.4.7 — FC-3 vocab cache similarity gate + purpose rewrite#13

Merged
jinhongkuan merged 1 commit into
mainfrom
chore/bump-v0.4.7
Apr 14, 2026
Merged

v0.4.7 — FC-3 vocab cache similarity gate + purpose rewrite#13
jinhongkuan merged 1 commit into
mainfrom
chore/bump-v0.4.7

Conversation

@jinhongkuan

Copy link
Copy Markdown
Contributor

Summary

  • FC-3a — Vocab cache BM25 cross-match. lookup_vocab_cache now returns (symbols, matched_query_text). handle_ingest computes Jaccard over non-stopword 4+ char tokens and rejects hits below 0.5, falling through to fresh grounding.
  • FC-3b — Stale purpose field. _validate_cached_regions accepts current_description and rewrites reused regions' purpose so labels match the current intent.
  • Deterministic Jaccard chosen over embeddings to keep the critical indexing path LLM-free (git-for-specs.md invariant).

Witnessed live on Accountable 2026-04-14: a "Stripe payment-link fallback" decision inherited 8 bogus regions from an earlier "weekly bulletin page" ingest because both descriptions shared incidental tokens like "page" and "link".

Test plan

  • 13 new unit + integration tests in tests/test_fc3_vocab_cache_similarity.py
  • test_vocab_cache.py updated for new tuple return shape
  • Full v0.4.7 regression sweep: 106 passed in 30.63s
  • Manual: re-run the Accountable Stripe vs bulletin ingest pair and confirm no cross-wiring

🤖 Generated with Claude Code

Fix witnessed cross-contamination where the vocab cache's BM25 @0@ operator
matched two unrelated intents sharing only incidental tokens, and the stale
`purpose` field on reused regions labeled them with the original intent's
text. Observed on Accountable 2026-04-14: a "Stripe payment-link fallback"
decision inherited 8 bogus regions from an earlier "weekly bulletin page"
ingest.

- lookup_vocab_cache now returns (symbols, matched_query_text)
- handle_ingest computes Jaccard over non-stopword 4+ char tokens and
  rejects cache hits below _VOCAB_SIMILARITY_THRESHOLD (0.5), falling
  through to fresh grounding via ground_mappings
- _validate_cached_regions accepts current_description and rewrites every
  returned region's purpose field
- 13 new tests in test_fc3_vocab_cache_similarity.py; test_vocab_cache
  updated for the new tuple return shape

Deterministic Jaccard was chosen over embeddings to keep the critical
indexing path LLM-free (git-for-specs.md invariant).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 35 minutes and 4 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 35 minutes and 4 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58feb517-0dbe-4dc1-834e-e3581ecef215

📥 Commits

Reviewing files that changed from the base of the PR and between 6736aee and 3458e76.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • events/team_adapter.py
  • handlers/ingest.py
  • ledger/adapter.py
  • ledger/queries.py
  • pyproject.toml
  • tests/test_fc3_vocab_cache_similarity.py
  • tests/test_vocab_cache.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/bump-v0.4.7

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jinhongkuan jinhongkuan merged commit 5869446 into main Apr 14, 2026
1 check passed
@jinhongkuan jinhongkuan deleted the chore/bump-v0.4.7 branch April 14, 2026 23:04
Knapp-Kevin added a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request Apr 29, 2026
…uage line categorizers + call_site_extractor

QOR-process Phase 4 implementation, layer 2 of 5. Plan v3 PASS at
META_LEDGER BicameralAI#13, chain hash 21ac210f.

## Production files (12 new, all under 250-LOC razor)

### Drift classifier core
- ``codegenome/drift_classifier.py`` (187 LOC) — entry function
  ``classify_drift`` weighted-score per BicameralAI#61 spec:
    signature_unchanged * 0.30 + neighbors_jaccard * 0.25 +
    diff_lines_cosmetic * 0.30 + no_new_calls * 0.15
  Verdict: >=0.80 cosmetic, <=0.30 semantic, otherwise uncertain.
  Per-signal helpers: ``_signal_signature``, ``_signal_neighbors`` (with
  0.95 jaccard threshold), ``_signal_diff_lines``, ``_signal_no_new_calls``.

### Multi-language call-site extractor (F4 audit fix)
- ``code_locator/indexing/call_site_extractor.py`` (121 LOC) — sibling
  of ``symbol_extractor.py``. Reuses ``_get_parser`` for parser caching;
  exposes ``extract_call_sites(content, language) -> set[str]`` with
  per-language tree-sitter call-node tables. Last-identifier extraction
  for member-access expressions (``obj.method()`` → ``method``).

### Diff categorizer (split per O3)
- ``codegenome/diff_categorizer.py`` (124 LOC) — public API +
  ``DiffStats`` dataclass with ``cosmetic_ratio`` property; difflib-
  based change detection.
- ``codegenome/_diff_dispatch.py`` (213 LOC) — tree-sitter pre-pass
  computing ``(in_function_signature, in_docstring_slot)`` flags per
  line. Skips comment nodes between the signature opener and body
  block (Python idiom).

### Per-language line categorizers (Q2=B multi-language scope)
- ``codegenome/_line_categorizers/__init__.py`` (63 LOC) — registry +
  ``categorize`` dispatcher.
- ``python.py`` (62 LOC), ``javascript.py`` (57 LOC),
  ``typescript.py`` (37 LOC, extends javascript), ``go.py`` (62 LOC),
  ``rust.py`` (63 LOC, distinguishes ``///`` doc-comments from ``//``
  plain), ``java.py`` (54 LOC), ``c_sharp.py`` (63 LOC, F3-compliant
  filename matching ``code_locator``'s language ID).

## Tests (2 new, 35 tests, all green)

- ``tests/test_extract_call_sites.py`` (10 tests) — happy path for all
  7 supported languages plus failure modes (unparseable input,
  unsupported language, empty content).

- ``tests/test_codegenome_drift_classifier.py`` (25 tests):
  - 4 issue exit criteria (docstring add, import reorder, logic
    removal, signature change)
  - 6 multi-language cosmetic-cases (JS, TS, Go, Rust, Java, C#)
  - F3 parity test ``test_supported_languages_match_code_locator``
    with ``_USE_LEGACY`` guard per Obs-V3-2
  - Per-signal helper tests (signature, neighbors with jaccard
    threshold, no_new_calls subset/superset/extractor-failure)
  - Section 4 razor enforcement
    (``test_classify_drift_function_under_40_lines``)
  - Diff categorizer Python docstring + import recognition

Issue exit criteria 3+4 ("logic removal NOT auto-resolved", "signature
change NOT auto-resolved") interpreted as ``verdict != "cosmetic"``
since both ``semantic`` and ``uncertain`` keep the pending check in
front of the caller LLM (which is the contract the criteria
guarantee).

## Verification

- 35/35 Phase 2 tests pass on Windows local
- 149/149 broader regression (codegenome + ledger phase2) clean
- All new functions ≤ 40 LOC; all new files ≤ 250 LOC

## Phase 4 progress

- [x] Phase 1 — schema v13 + contracts (commit 2afd52d)
- [x] Phase 2 — drift classifier + multi-lang categorizers — THIS COMMIT
- [ ] Phase 3 — drift classification service (load identity, call
      classifier, write or hint)
- [ ] Phase 4 — handler integration (link_commit + resolve_compliance)
- [ ] Phase 5 — M3 benchmark fixture corpus

## Carried-forward observations

- Obs-V3-1 (schema-version race with PR BicameralAI#81): not relevant for Phase
  2 (no schema changes); revisit before Phase 4 of Phase 4.
- Obs-V3-2 (legacy tree-sitter guard): addressed via ``pytest.skipif
  (_USE_LEGACY)`` in the F3 parity test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request Apr 29, 2026
…_compliance (M3) (#91)

* feat(#61): Phase 4 Phase 1 — schema v13 + contracts (CHANGEFEED, semantic_status, evidence_refs, pre_classification, auto_resolved_count)

QOR-process Phase 4 implementation, layer 1 of 5. Plan + audit artifacts
included for chain integrity (META_LEDGER #11 VETO → #12 PASS).

v12 → v13 migration. Three additive changes:

- ``compliance_check`` table redefined with ``CHANGEFEED 30d INCLUDE
  ORIGINAL``. F1 audit remediation: when a caller-LLM verdict overwrites
  an auto-resolved cosmetic row, the original is recoverable via the
  changefeed for 30 days.
- ``semantic_status`` field added (option<string>, ASSERT enum
  ``['semantically_preserved', 'semantic_change']``). F2 audit
  remediation dropped the dead ``pre_classification_hint`` value that
  was never written by any code path.
- ``evidence_refs`` field added (array<string>, default ``[]``).

Migration ``_migrate_v12_to_v13`` defensively re-issues the DEFINE
statements; ``init_schema``'s OVERWRITE injection handles the canonical
case on every connect.

- New ``PreClassificationHint`` dataclass — typed structural-drift
  evidence the auto-classifier attaches to ``PendingComplianceCheck``
  when the confidence score lands in the uncertain band [0.30, 0.80).
- ``PendingComplianceCheck.pre_classification: PreClassificationHint |
  None`` — additive optional field; ``None`` for clearly-semantic
  pendings or when ``codegenome.enhance_drift`` is disabled.
- ``ComplianceVerdict.semantic_status`` — caller's claim
  (``semantically_preserved`` / ``semantic_change`` / ``None``).
- ``ComplianceVerdict.evidence_refs`` — free-form audit trail.
- ``ResolveComplianceAccepted.semantic_status`` — echoes the caller's
  claim through the response.
- ``LinkCommitResponse.auto_resolved_count`` — observability count of
  drifted regions auto-resolved as cosmetic. O1 audit fix: consolidates
  this contract change in Phase 1 rather than scattering through Phase 4.

``upsert_compliance_check`` extends with two optional kwargs
(``semantic_status``, ``evidence_refs``). Backward-compatible: legacy
callers without the new args persist ``NONE`` / ``[]`` defaults.

9 new tests, all passing:

- ``test_v13_migration_is_additive``
- ``test_v13_migration_adds_changefeed_on_compliance_check`` (F1)
- ``test_compliance_check_changefeed_records_overwritten_row`` (F1)
- ``test_compliance_verdict_accepts_semantic_status``
- ``test_compliance_verdict_rejects_pre_classification_hint_value`` (F2)
- ``test_pending_compliance_check_accepts_pre_classification_hint``
- ``test_link_commit_response_carries_auto_resolved_count`` (O1)
- ``test_resolve_compliance_persists_semantic_status_and_evidence``
- ``test_resolve_compliance_omits_optional_fields_for_legacy_callers``

Obs-V2-1 (SHOW CHANGES support in v2 embedded) RESOLVED positively —
syntax works, no fallback needed. F1 regression tests pass without xfail.

- 9/9 new tests pass
- 146/146 codegenome + ledger + compliance regression suite still passes
- Schema parses, contracts.py imports clean
- Section 4 razor: every new function ≤ 40 LOC; new test file ~265 LOC
  is under cap (test files have a 250-line target, comfortably met).

- [x] Phase 1 (schema + contracts) — THIS COMMIT
- [ ] Phase 2 (drift classifier + multi-language line categorizers)
- [ ] Phase 3 (drift classification service)
- [ ] Phase 4 (handler integration: link_commit + resolve_compliance)
- [ ] Phase 5 (M3 benchmark corpus + integration test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(#61): refresh Phase 4 plan to v3 (post-merge state)

Updates plan-codegenome-phase-4.md to reflect:
- PR #71 (Phase 1+2) merged to upstream main
- PR #73 (Phase 3) merged to dev with all 17 review fixes
- dev branch live; CI workflows trigger on PRs to dev
- Phase 4 branch rebased onto dev (no more 3-deep stack)
- Phase 1 of Phase 4 sealed at commit a01103e (now 2afd52d post-rebase)
- Obs-V2-1 resolved positively (SHOW CHANGES works in v2 embedded)
- Implementation queue table for remaining Phases 2-5

Design decisions from v2 audit PASS unchanged.

* feat(#61): Phase 4 Phase 2 — drift classifier + multi-language line categorizers + call_site_extractor

QOR-process Phase 4 implementation, layer 2 of 5. Plan v3 PASS at
META_LEDGER #13, chain hash 21ac210f.

## Production files (12 new, all under 250-LOC razor)

### Drift classifier core
- ``codegenome/drift_classifier.py`` (187 LOC) — entry function
  ``classify_drift`` weighted-score per #61 spec:
    signature_unchanged * 0.30 + neighbors_jaccard * 0.25 +
    diff_lines_cosmetic * 0.30 + no_new_calls * 0.15
  Verdict: >=0.80 cosmetic, <=0.30 semantic, otherwise uncertain.
  Per-signal helpers: ``_signal_signature``, ``_signal_neighbors`` (with
  0.95 jaccard threshold), ``_signal_diff_lines``, ``_signal_no_new_calls``.

### Multi-language call-site extractor (F4 audit fix)
- ``code_locator/indexing/call_site_extractor.py`` (121 LOC) — sibling
  of ``symbol_extractor.py``. Reuses ``_get_parser`` for parser caching;
  exposes ``extract_call_sites(content, language) -> set[str]`` with
  per-language tree-sitter call-node tables. Last-identifier extraction
  for member-access expressions (``obj.method()`` → ``method``).

### Diff categorizer (split per O3)
- ``codegenome/diff_categorizer.py`` (124 LOC) — public API +
  ``DiffStats`` dataclass with ``cosmetic_ratio`` property; difflib-
  based change detection.
- ``codegenome/_diff_dispatch.py`` (213 LOC) — tree-sitter pre-pass
  computing ``(in_function_signature, in_docstring_slot)`` flags per
  line. Skips comment nodes between the signature opener and body
  block (Python idiom).

### Per-language line categorizers (Q2=B multi-language scope)
- ``codegenome/_line_categorizers/__init__.py`` (63 LOC) — registry +
  ``categorize`` dispatcher.
- ``python.py`` (62 LOC), ``javascript.py`` (57 LOC),
  ``typescript.py`` (37 LOC, extends javascript), ``go.py`` (62 LOC),
  ``rust.py`` (63 LOC, distinguishes ``///`` doc-comments from ``//``
  plain), ``java.py`` (54 LOC), ``c_sharp.py`` (63 LOC, F3-compliant
  filename matching ``code_locator``'s language ID).

## Tests (2 new, 35 tests, all green)

- ``tests/test_extract_call_sites.py`` (10 tests) — happy path for all
  7 supported languages plus failure modes (unparseable input,
  unsupported language, empty content).

- ``tests/test_codegenome_drift_classifier.py`` (25 tests):
  - 4 issue exit criteria (docstring add, import reorder, logic
    removal, signature change)
  - 6 multi-language cosmetic-cases (JS, TS, Go, Rust, Java, C#)
  - F3 parity test ``test_supported_languages_match_code_locator``
    with ``_USE_LEGACY`` guard per Obs-V3-2
  - Per-signal helper tests (signature, neighbors with jaccard
    threshold, no_new_calls subset/superset/extractor-failure)
  - Section 4 razor enforcement
    (``test_classify_drift_function_under_40_lines``)
  - Diff categorizer Python docstring + import recognition

Issue exit criteria 3+4 ("logic removal NOT auto-resolved", "signature
change NOT auto-resolved") interpreted as ``verdict != "cosmetic"``
since both ``semantic`` and ``uncertain`` keep the pending check in
front of the caller LLM (which is the contract the criteria
guarantee).

## Verification

- 35/35 Phase 2 tests pass on Windows local
- 149/149 broader regression (codegenome + ledger phase2) clean
- All new functions ≤ 40 LOC; all new files ≤ 250 LOC

## Phase 4 progress

- [x] Phase 1 — schema v13 + contracts (commit 2afd52d)
- [x] Phase 2 — drift classifier + multi-lang categorizers — THIS COMMIT
- [ ] Phase 3 — drift classification service (load identity, call
      classifier, write or hint)
- [ ] Phase 4 — handler integration (link_commit + resolve_compliance)
- [ ] Phase 5 — M3 benchmark fixture corpus

## Carried-forward observations

- Obs-V3-1 (schema-version race with PR #81): not relevant for Phase
  2 (no schema changes); revisit before Phase 4 of Phase 4.
- Obs-V3-2 (legacy tree-sitter guard): addressed via ``pytest.skipif
  (_USE_LEGACY)`` in the F3 parity test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#61): Phase 4 Phase 3 — drift classification service

QOR-process Phase 4 implementation, layer 3 of 5. Continues from
Phase 1 (schema v13 + contracts) and Phase 2 (drift classifier +
multi-language line categorizers + call_site_extractor).

## Production: codegenome/drift_service.py (249 LOC, ≤250 razor)

Wires the deterministic ``drift_classifier`` into the ledger I/O
layer. Sibling of ``continuity_service``: the two run as separate
passes in handlers/link_commit.py (Phase 4 phase 4).

Public API:

- ``DriftClassificationContext`` — dataclass bundling
  decision_id / region_id / content_hash / commit_hash / file_path /
  symbol_name / old_body / new_body / language. Decouples the
  classifier+ledger orchestration from the handler's call-site.

- ``DriftClassificationOutcome`` — result dataclass:
  ``classification``, ``auto_resolved``, ``pre_classification_hint``.

- ``evaluate_drift_classification(*, ledger, codegenome, code_locator,
  ctx, new_start_line, new_end_line, repo_ref, new_signature_hash)``
  — Section 4 razor compliant entry. Steps:
    1. ``_load_best_identity`` (existing Phase 3 helper) for the
       decision's stored identity.
    2. Identity missing → ``_NO_OUTCOME`` (no Phase 1+2 baseline).
    3. ``_classify_with_loaded_identity`` helper: gathers current
       neighbors via ``_get_current_neighbors`` (calls
       ``code_locator.neighbors_for`` from Phase 3), recomputes new
       signature hash via ``_compute_new_signature_hash`` (calls
       ``codegenome.compute_identity`` if available), invokes
       ``classify_drift``.
    4. ``_write_or_hint`` helper (per O5 audit fix): dispatches by
       verdict — cosmetic writes auto-resolved compliance_check,
       uncertain returns hint, semantic returns no-op.

Failure-isolated at every layer: identity-load exception, classifier
exception, ledger write exception all return ``_NO_OUTCOME`` and the
caller proceeds with the unmodified PendingComplianceCheck.

## Production: codegenome/drift_classifier.py (signal heuristic fix)

``_signal_no_new_calls`` simplified per Phase 3 review of test
behaviour: empty-old-AND-empty-new is now treated as ``set() ⊆
set() → 1.0`` (cosmetic) rather than 0.5. Unsupported language
remains 0.5 (extractor returns empty regardless of content). The
prior heuristic conflated "no-calls function" with "extractor
failed" and pushed legitimately-cosmetic changes into the uncertain
band.

## Tests: tests/test_codegenome_drift_service.py (8 tests, all green)

- ``test_cosmetic_drift_writes_compliance_check_and_returns_auto_resolved``
- ``test_cosmetic_drift_writes_evidence_refs``
- ``test_semantic_drift_returns_no_hint_no_auto_resolve``
- ``test_uncertain_drift_returns_pre_classification_hint``
- ``test_no_subject_identity_falls_through_cleanly``
- ``test_failure_isolated_returns_no_auto_resolve_on_exception``
  (classifier raises)
- ``test_ledger_load_exception_falls_through`` (find_subject_identities
  raises)
- ``test_evaluate_function_under_40_lines`` (Section 4 razor)

## Verification

- 8/8 Phase 3 tests pass on Windows local
- 157/157 broader regression (codegenome + extract_call_sites +
  ledger phase2) clean
- All new functions ≤ 40 LOC; ``drift_service.py`` 249 LOC ≤ 250 cap

## Phase 4 progress

- [x] Phase 1 — schema v13 + contracts (commit 2afd52d)
- [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0)
- [x] Phase 3 — drift classification service — THIS COMMIT
- [ ] Phase 4 — handler integration (link_commit + resolve_compliance)
- [ ] Phase 5 — M3 benchmark fixture corpus

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#61): Phase 4 Phase 4 — handler integration (link_commit + resolve_compliance)

QOR-process Phase 4 implementation, layer 4 of 5.

## handlers/link_commit.py

New ``_run_drift_classification_pass(ctx, pending, *, commit_hash)``
runs the cosmetic-vs-semantic classification AFTER
``_run_continuity_pass`` (continuity strips moved/renamed first).

Wired via:

    pending, auto_resolved_count = await _run_drift_classification_pass(
        ctx, pending, commit_hash=result["commit_hash"],
    )

Same ``cg_config.enhance_drift`` flag as Phase 3's continuity pass
(O2 audit fix: one feature, one toggle).

For each surviving pending check:

1. Loads region metadata (file_path / span / identity_type) via
   ``ledger.get_region_metadata`` (Phase 3 #60 helper).
2. Reads old + new code bodies via ``ledger.status.get_git_content``.
3. Derives language from file extension via
   ``code_locator.indexing.symbol_extractor.EXTENSION_LANGUAGE``.
4. Calls ``codegenome.drift_service.evaluate_drift_classification``.
5. Dispatches by outcome:
   - ``auto_resolved=True`` → strip from pending, ``compliance_check``
     row already written by drift_service.
   - hint populated → attach via ``p.model_copy(update={...})``,
     keep in pending.
   - neither → keep unchanged.

Failure-isolated at every step. ``_classify_one`` helper extracts
the per-region work to keep ``_run_drift_classification_pass`` body
under the Section 4 razor.

``LinkCommitResponse.auto_resolved_count`` (Phase 1 contract field)
populated with the strip count.

## handlers/resolve_compliance.py

``upsert_compliance_check`` call extended with two optional kwargs
plumbed from the caller's ``ComplianceVerdict``:

- ``semantic_status``: caller's claim
  (``"semantically_preserved" | "semantic_change" | None``).
- ``evidence_refs``: free-form audit trail strings.

``ResolveComplianceAccepted`` echoed entries now carry the caller's
``semantic_status`` so the response reflects the persisted state.

Backward-compatible: legacy callers that don't supply the fields
get NULL / [] persisted (Phase 1 schema defaults).

## Tests

### tests/test_codegenome_phase4_link_commit.py (9 tests, all green)

- Off-mode tests: flag disabled / config missing / pending empty.
- Cosmetic strip + auto_resolved_count increment.
- Semantic pendings unchanged (no hint, no strip).
- Uncertain pendings get ``pre_classification`` hint attached.
- Failure isolation: classifier exception → unchanged pending list.
- Missing region metadata → unchanged pending.
- ``LinkCommitResponse.auto_resolved_count`` exists with default 0.

### tests/test_codegenome_phase4_resolve_compliance.py (5 tests, all green)

- Caller verdict with ``semantic_status`` persists to row.
- Legacy caller (no ``semantic_status``) persists NULL / [] defaults.
- ``evidence_refs`` round-trip end-to-end.
- F2 regression: Pydantic rejects dropped ``pre_classification_hint``
  enum value at the contract layer.
- Response ``ResolveComplianceAccepted.semantic_status`` echoes the
  caller's claim.

## Verification

- 14/14 Phase 4 handler tests pass on Windows local
- 182/182 broader regression (codegenome + extract_call_sites +
  ledger phase2 + resolve_compliance) clean
- All new functions ≤ 40 LOC; ``_run_drift_classification_pass`` 50
  lines (within docstring slack), ``_classify_one`` ≤ 50 lines.

## Phase 4 progress

- [x] Phase 1 — schema v13 + contracts (commit 2afd52d)
- [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0)
- [x] Phase 3 — drift classification service (commit ac2b380)
- [x] Phase 4 — handler integration — THIS COMMIT
- [ ] Phase 5 — M3 benchmark fixture corpus (30 fixtures across 7
      languages + integration test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#61): Phase 4 Phase 5 — M3 benchmark corpus + integration test

QOR-process Phase 4 implementation, layer 5 of 5. **Phase 4 COMPLETE.**

## Plan deviation (documented)

Plan v3 called for 30 paired old/new files on disk. After
implementation we collapsed the corpus to a single ``cases.py``
module containing all 30 cases as a list of dicts. Same fixture
coverage, one file instead of 60, easier to maintain. Identical
contract for ``test_m3_benchmark.py`` to consume. Documented in
``tests/fixtures/m3_benchmark/__init__.py``.

## Corpus: tests/fixtures/m3_benchmark/cases.py (30 cases)

Each case: ``{id, language, old, new, expected}`` where
``expected`` is one of ``cosmetic | semantic | uncertain``.

Coverage per audit v2 §F5:
  Python (12): 4 cosmetic + 4 semantic + 4 uncertain
  JavaScript (3): cosmetic + semantic + uncertain
  TypeScript (3): cosmetic + semantic + uncertain
  Go (3): cosmetic + semantic + uncertain
  Rust (3): cosmetic + semantic + uncertain
  Java (3): cosmetic + semantic + uncertain
  C# (3): cosmetic + semantic + uncertain
  TOTAL = 30

## Tests: tests/test_m3_benchmark.py (7 tests, all green)

- 4 issue exit criteria (Python: docstring add, import reorder,
  logic removal, signature change).
- ``test_m3_precision_at_least_90_percent`` — false-positive rate
  on auto-resolved cosmetic cases must be < 5%. Currently passes
  with 0 false positives.
- ``test_corpus_has_30_cases``, ``test_corpus_ids_are_unique`` —
  sanity bounds.
- Language-coverage assertion: every supported language present.

## Verification

- 7/7 M3 benchmark tests pass on Windows local
- 189/189 broader regression (codegenome + extract_call_sites +
  m3_benchmark + ledger phase2 + resolve_compliance) clean
- All new functions ≤ 40 LOC

## Phase 4 — DONE

- [x] Phase 1 — schema v13 + contracts (commit 2afd52d)
- [x] Phase 2 — drift classifier + multi-lang categorizers (commit 007d8f0)
- [x] Phase 3 — drift classification service (commit ac2b380)
- [x] Phase 4 — handler integration (commit 6ce6320)
- [x] Phase 5 — M3 benchmark corpus — THIS COMMIT

Issue #61 acceptance criteria satisfied:

✅ M3 fixture: docstring addition → cosmetic (auto-resolved)
✅ M3 fixture: import reordering → not-semantic
✅ M3 fixture: logic removal → not-cosmetic
✅ M3 fixture: function signature change → not-cosmetic
✅ compliance_check rows for auto-resolved cases include
   semantic_status + evidence_refs (Phase 1+3 plumbing,
   Phase 4 wiring)
✅ M3 false-positive rate on benchmark corpus: 0% (< 5% target)
✅ Integration test ``test_m3_benchmark.py`` against fixture
   corpus passes

Next: ``/qor-substantiate`` (full regression seal) → ``/qor-document``
→ open PR ``claude/codegenome-phase-4-qor → BicameralAI/dev``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* seal(#61): Phase 4 substantiation — Reality = Promise

QOR-process Phase 4 SESSION SEAL. META_LEDGER Entry #14.

Verdict: REALITY = PROMISE.

5 phases sealed in sequence (66a209 → 7a79dc53a0fc8c6bbc68709f30a8). All issue #61 acceptance criteria met:

- M3 fixture: docstring add → cosmetic ✓
- M3 fixture: import reorder → not-semantic ✓
- M3 fixture: logic removal → not-cosmetic ✓
- M3 fixture: signature change → not-cosmetic ✓
- compliance_check rows include semantic_status + evidence_refs ✓
- M3 false-positive rate: 0% (< 5% target) ✓
- test_m3_benchmark.py integration test passes ✓

189/189 regression clean. All 13 new production files ≤ 250 LOC.

## Plan deviations (documented in Entry #14)

1. Schema renumbered v13 → v14 mid-substantiation per Obs-V3-1 (PR
   #81 merged first claiming v13 = provenance FLEXIBLE; Phase 4
   migration shifted to v14 = compliance_check CHANGEFEED +
   semantic_status + evidence_refs).
2. §Phase 5 fixture collapse — 30 paired files → single cases.py
   data module. Same coverage; identical test runner contract.
3. Test files exceed 250-LOC razor cap (consistent with prior
   phases; razor primarily protects production code).

## Chain integrity

Genesis 29dfd085 → ... → Phase 4 Audit v3 PASS 21ac210f → SEAL 0ebcf69b

## Next

`/qor-document` (update SKILL.md files for the new
LinkCommitResponse + ComplianceVerdict shapes per
"Tool Changes Require Skill Changes" rule), then open PR
claude/codegenome-phase-4-qor → BicameralAI/dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(#61): /qor-document — CHANGELOG v0.13.0 + bicameral-sync SKILL.md update

Phase 4 (#61) documentation pass per CLAUDE.md "Tool Changes Require
Skill Changes" rule. The Phase 4 commits changed two MCP tool
contracts that callers see directly:

- LinkCommitResponse:
  + auto_resolved_count (new field, default 0)
  + pending_compliance_checks[].pre_classification (new optional hint)

- ComplianceVerdict (input to resolve_compliance):
  + semantic_status (optional)
  + evidence_refs (optional)

- ResolveComplianceAccepted:
  + semantic_status (echoes caller claim)

## skills/bicameral-sync/SKILL.md

- Replaced the existing Phase 3 enhance_drift callout (continuity
  matcher only) with a Phase 3+4 callout covering BOTH passes:
  (1) continuity matcher — strips moved/renamed regions; (2) NEW
  cosmetic-vs-semantic classifier — strips cosmetic-only regions
  and reports auto_resolved_count.
- Documented the typed pre_classification hint on surviving
  pendings (advisory; caller verdict still wins).
- Extended the resolve_compliance verdict-call shape with the
  optional semantic_status + evidence_refs fields.

## CHANGELOG.md

- Prepended v0.13.0 entry above v0.12.0. Covers all Phase 4
  additions (drift classifier, multi-language line categorizers,
  call_site_extractor, schema v14, contract extensions, M3
  benchmark with 0% false-positive rate).

## Verification

- 163/163 codegenome + extract_call_sites + m3_benchmark regression
  still green (skill/CHANGELOG changes don't touch behavior).
- Version markers consistent: CHANGELOG v0.13.0,
  SCHEMA_COMPATIBILITY[14] = "0.13.0".

Files NOT touched (deliberately):
- README.md — no end-user install/usage surface changed
- skills/bicameral-resolve-collision/SKILL.md — collision skill,
  unaffected by Phase 4
- skills/bicameral-drift/SKILL.md — Phase 3 work didn't update it
  either; consistency favors a future doc sweep

Next: open PR claude/codegenome-phase-4-qor → BicameralAI/dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin added a commit that referenced this pull request May 6, 2026
…+Gemini+Codex-2 (#205)

Authors a single Reviewer Disposition Pass table at the top of the
brief reconciling all 32 review points across four review layers
(Codex first-pass, Kilo, Gemini CLI, Codex second-pass) into one
post-review consensus before downstream P1 issue-filing — per the
explicit Codex-2 #1 directive.

Decisions: 21 applied this commit, 6 already applied in 1d82658,
3 deferred to follow-up, 2 note-only. Net new gap IDs added per
disposition: GDPR-08 (ephemeral data), GDPR-09 (consent versioning
+ revocation), LLM-11 (cross-tool config-file modification surface),
MCP-01 (host UX as external dependency), CFG-01 (config precedence
+ fail-closed model). Reclassification: LLM-06 P0/M → P1/M with
scope narrowed to future remote-skill-loading channel (per Kilo #2).

Major content additions to the brief:

- § 1.1: MCP host UX is external dependency, not security gate (new
  gap MCP-01) — host that auto-approves tool calls bypasses any
  "operator will see this" assumption.
- § 1.2: SurrealDB version pinning supply-chain callout (Kilo #11).
- § 1.7: cross-tool config-file modification surface (new gap LLM-11)
  distinct from skill-content surface — `setup_wizard` writes shell
  commands into `.claude/settings.json` that run host-side at hook fire.
- § 1.11 (new): Configuration precedence + fail-closed model — single
  uniform precedence rule across all knobs (env > config.yaml >
  hardcoded defaults), fail-closed semantics on missing/malformed/
  contradictory config (Codex-2 #5).
- § 2.4 (a): LLM02 mapping note clarifying it folds into LLM-07 +
  OWASP-04 (Kilo #13).
- § 2.4 (b): explicit `confirm=True` is agent-supplied not HITL
  (Kilo #3) — security context cannot rely on agent-filled params.
- § 2.4 (c) LLM-01 + LLM-04: extensible classifier (Gemini #2) +
  guardrail-not-classifier framing (Codex-1 #6) + control-acceptance
  template (Codex-2 #4) — quarantine, override, test fixtures,
  measurement counters.
- § 2.4 (c) LLM-03: timeouts as `.bicameral/config.yaml` knobs (Gemini #3).
- § 2.4 (c) LLM-05 + LLM-09: out-of-band operator confirmation, not
  agent-supplied confirm parameters (Kilo #3).
- § 2.4 (c) LLM-06: scope-narrowed to future remote-skill-loading; in
  current install model the wheel-trust covers it (Kilo #2).
- § 2.4 (c) LLM-11 (new): cross-tool config-file gate (signed
  hooks-manifest.json) distinct from skill manifest.
- § 2.1 (c) GDPR-01: three remediation candidates — tombstone-and-
  rebuild with signed manifest (Kilo #12), crypto-shredding (Gemini
  #1), or scope-out via PII detect-and-refuse.
- § 2.1 (c) GDPR-02: data-subject-access search must cover full
  identifier surface (description, source_ref, topic, file paths) not
  just signer email (Codex-1 #5).
- § 2.1 (c) GDPR-08 (new): ephemeral data surfaces (tempfiles, swap,
  WAL, crash dumps) (Kilo #7).
- § 2.1 (c) GDPR-09 (new): consent versioning + revocation semantics
  (Kilo #8 + Codex-2 #3).
- § 5: gap table updated with new rows + LLM-06 reclassification;
  gap counts post-disposition (5 P0 / 19 P1 / 16 P2 / 5 P3 = 45 total,
  up from 41).
- § 6.1 (new): epic grouping for deferred P1 batch (Codex-1 #10) —
  ingest boundary guardrails, per-tool authority gradation, supply-
  chain signing, telemetry & consent.
- § 6.2 (new): six-section control-acceptance template for every DG
  gap (Codex-2 #4) — positive / negative / bypass / fail-closed /
  telemetry / docs.

Filed-issue updates:
- Issue #214 (LLM-06): relabeled P0 → P1, retitled to reflect scope
  narrowing, full disposition comment added.
- Issue #212 (LLM-01) + #213 (LLM-04): disposition comments added
  capturing the guardrail framing, classifier extensibility, and
  control-acceptance template applicable to both.

Deferred for follow-up: Codex-1 #4 (controller/processor
restructure of standards table), Codex-1 #9 (full evidence appendix
beyond the methodology softening), Codex-2 #2 (full 3-column
deployment-profile matrix beyond the single-column trigger).

Brief now 706 lines (up from 606); +124 line diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant