feat(#44): LLM drift judge — uncertain-band sub-protocol by Knapp-Kevin · Pull Request #103 · BicameralAI/bicameral-mcp

Knapp-Kevin · 2026-04-29T16:59:34Z

Summary

Extends Phase 4's deterministic drift classifier with the LLM-judge layer. When a region scores in the uncertain band [0.30, 0.80), the bicameral-sync skill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?). No new tools, no new contracts — leverages Phase 4's semantic_status + evidence_refs fields already on ComplianceVerdict.
Adds 10 hand-authored expected_judge ground-truth labels to the M3 benchmark uncertain-band cases. These are the targets the operator QC pass measures the LLM judge against post-merge.
New training doc docs/training/cosmetic-vs-semantic.md walks the concept end-to-end with a worked example from py_12_constant_value_tuned. Pairs with the SKILL.md rubric.

Linked issues

Closes #44

Refs:

CodeGenome Phase 4: Semantic Drift Evaluation in resolve_compliance (M3) #61 (CodeGenome Phase 4 — upstream deterministic classifier)
docs: development cycle reference + demos/guides/training scaffolding #93 (DEV_CYCLE.md — soft dependency; see "Soft dependencies" below)

Plan / Audit / Seal

Plan: plan-codegenome-llm-drift-judge.md (commit d846a4a, content hash sha256:7094b9b6…)
Audit: META_LEDGER Entry v0.4.9 — tester mode + search status fix #15 — verdict PASS post-remediation. Chain hash 536dd15f…. Audit history: v1 VETO at b15c9ef (F-1 grounding error: pilot/mcp/skills/ referenced by CLAUDE.md but does not exist on dev — SG-PLAN-GROUNDING-DRIFT instance feat: event-sourced collaboration (Phase 1) #2, catalogued at SHADOW_GENOME decision grounding reuse + coverage loop #5); v2 PASS at d846a4a after mechanical remediation.
Seal: META_LEDGER Entry v0.4.10 — guided mode (always-on hints, intensity-gated) #16 — substantiation seal sha256:567170e0…. Reality matches Promise; one documented plan deviation (training README created vs modified — see "Soft dependencies").

Test plan

pytest tests/test_m3_benchmark_judge_corpus.py -q (4/4 — Phase 1 corpus contract)
pytest tests/test_skill_uncertain_protocol.py -q (4/4 — Phase 2 SKILL.md conformance)
pytest tests/test_m3_benchmark.py -q (regression on existing M3 — no behaviour change expected)
pytest tests/test_codegenome_drift_classifier.py tests/test_codegenome_drift_service.py -q (regression on Phase 4 — data-only plan should not affect classifier)
Targeted sweep total: 48/48 (8 new + 40 regression)
Operator QC pass (post-merge): run bicameral-sync skill against the 10 uncertain-band cases; compare LLM verdicts to expected_judge ground truth. Pass thresholds per plan §D6 decision grounding reuse + coverage loop #5: 0% FP on cosmetic-claimed verdicts, ≤ 20% FN. Record outcome as PR comment or META_LEDGER addendum. Qualitative gate, not CI-blocking.

What this PR is NOT

Not a new tool, schema migration, or contract change.
Not server-side — judge runs in the caller LLM via the skill rubric. Preserves the docs/CONCEPT.md anti-goal "Not an LLM-powered ledger".
Not a v0.14.0 release. Source version stays at v0.13.x; release PR is Jin's call.
Not the telemetry surface (acceptance criterion fix: ingest pipeline — input contracts, payload normalization, freshness guards #3 of [P2] LLM semantic drift judge: suppress false-positive drift flags on cosmetic code changes #44, bicameral.usage_summary cosmetic_drift_pct). That gates on PR feat: local telemetry counters + usage_summary + first-boot consent (v0.14.0) #95 landing first; deferred to a follow-up issue.

Soft dependencies (read before merging)

PR docs: development cycle reference + demos/guides/training scaffolding #93 (DEV_CYCLE.md): this PR creates docs/training/README.md as a minimal index because PR docs: development cycle reference + demos/guides/training scaffolding #93 — which scaffolds the same file with a richer template — hasn't merged to dev yet. Standard merge resolution will reconcile when one or both PRs land. No reviewer action needed; flagged here so the file isn't read as an unauthorised addition.

Files changed (7)

File	Change	LOC
`tests/test_m3_benchmark_judge_corpus.py`	NEW	83
`tests/test_skill_uncertain_protocol.py`	NEW	96
`docs/training/cosmetic-vs-semantic.md`	NEW	198
`docs/training/README.md`	NEW (mirrors PR #93 scaffolding)	49
`tests/fixtures/m3_benchmark/cases.py`	MODIFIED (+40 LOC; expected_judge on 10 uncertain cases)	391 → 431
`skills/bicameral-sync/SKILL.md`	MODIFIED (+61 LOC; §2.bis Uncertain-band sub-protocol)	150 → 211
`CHANGELOG.md`	MODIFIED (`[Unreleased]` entry added)	—

Plus governance updates (META_LEDGER #15-16, SYSTEM_STATE.md #44 snapshot, SHADOW_GENOME #5).

Reviewer ask

@jinhongkuan — the SKILL.md rubric is the contract the bicameral-sync caller LLM follows. Worth one read for tone/clarity. Risk grade L1 (single reviewer per DEV_CYCLE.md §4.4); no security-pass note needed.

Specifically: §2.bis as written calls Axis 1 (compliance) the first judgment and treats not_relevant as a short-circuit that skips Axis 2. Verify this matches your mental model — if you'd rather see the axes evaluated in parallel and reconciled at emit time, this is the place to say so.

coderabbitai · 2026-04-29T16:59:41Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76c2f767-4508-49be-a5bd-c43aafef502b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Issue BicameralAI#44 (P2) plan, targeting v0.14.0. Three architectural calls: D1. Skill-side, not server-side — preserves the local-first / LLM-free-server anti-goal in docs/CONCEPT.md. D2. Caching is already free via Phase 4's compliance_check writes (semantic_status + evidence_refs persisted). D3-D4. Reuses existing typed contracts — no new fields, no new tools, no schema changes. The judge maps to the existing two-axis output (verdict + semantic_status). D5. The rubric is data — text in SKILL.md — not Python code. The LLM follows it during the uncertain-band sub-protocol. D6. Five exit criteria, four CI-checkable + one operator QC pass. Two phases: - Phase 1: M3 benchmark corpus extension — every uncertain case gains expected_judge: {verdict, semantic_status}. Pure data. - Phase 2: Skill rubric — bicameral-sync SKILL.md gains an Uncertain-band sub-protocol section + paired training doc in docs/training/cosmetic-vs-semantic.md. Risk grade L1 (skill rubric + docs + test data; no production code paths). Telemetry surface (acceptance criterion 3 of BicameralAI#44) deferred to a follow-up gated on PR BicameralAI#95 landing — flagged in plan Open Question 3. Branched off BicameralAI/dev post-Phase-4 seal (200dbd5).

Audit found: - F-1 (BLOCKING): pilot/mcp/skills/bicameral-sync/SKILL.md does not exist on dev; plan instruction to modify it was unimplementable. Trusted CLAUDE.md's stale 'canonical' claim over filesystem empirics. SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 in this session. - F-2: SKILL.md baseline 138 LOC was wrong (actual 150). - F-3: ruff exclusion is `tests` directory-wide, not just `tests/fixtures/**`. Remediation: - Drop the pilot/mcp affected-file bullet from Phase 2. - Drop test_pilot_skill_md_matches_skills_skill_md from Phase 2 test count (5 → 4 tests, ~70 → ~60 LOC). - Add explanatory note: CLAUDE.md's pilot/mcp/skills/ reference is stale; flagged for separate cleanup workstream. - Update SKILL.md baseline 138 → 150 LOC (target 200 not 190). - Reword ruff exemption claim from `tests/fixtures/**` to "entire `tests/` directory". The plan's intent is preserved: skill rubric extension + test corpus expansion + training doc, all on the empirically canonical skills/ path. Re-audit pending.

…ameralAI#5 — audit PASS - META_LEDGER BicameralAI#15: GATE TRIBUNAL entry covering both audit iterations (v1 VETO at b15c9ef, v2 PASS at d846a4a). Chain hash 536dd15f extends from BicameralAI#14 Phase 4 SEAL. - SHADOW_GENOME BicameralAI#5: SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 catalogued. Cross-references PR BicameralAI#93 §9 as instance #1 (same root cause: CLAUDE.md asserts pilot/mcp/skills/ canonicality but dev HEAD has no pilot/ directory). Followup: docs:claude-md-cleanup workstream to fix CLAUDE.md itself. Plan PASS at d846a4a; chain to /qor-implement.

Phase 1 — M3 benchmark judge-corpus extension: - tests/test_m3_benchmark_judge_corpus.py (4 tests, 83 LOC) - tests/fixtures/m3_benchmark/cases.py — expected_judge field added to all 10 uncertain cases (pure data, ground-truth labels for the operator QC pass) Phase 2 — bicameral-sync skill rubric + training doc: - tests/test_skill_uncertain_protocol.py (4 tests, 96 LOC) - skills/bicameral-sync/SKILL.md §2.bis — Uncertain-band sub-protocol section: Axis 1 (compliance) FIRST, Axis 2 (cosmetic-vs-semantic) SECOND, signals advisory, evidence_refs echoed back. Maps to existing typed contracts (no new fields). - docs/training/cosmetic-vs-semantic.md (198 LOC) — concept doc with worked example from py_12_constant_value_tuned. Pairs with the rubric. - docs/training/README.md — index with cosmetic-vs-semantic active row. Soft-depends on PR BicameralAI#93's docs/training/ scaffolding; this branch creates a minimal version that PR BicameralAI#93 will reconcile on merge. Validation: - Phase 1 + Phase 2 new tests: 8/8 green. - M3 + drift_classifier regression: 32/32 green. - Total: 40/40 green in the targeted sweep. Razor: - All test files ≤ 96 LOC (cap 250). - All test functions ≤ 25 LOC (cap 40). - cases.py 431 LOC under tests/ ruff exclusion. - No production code changes; no schema changes; no new contracts; no new tools; no new dependencies. CHANGELOG: [Unreleased] entry added under Added. Plan: plan-codegenome-llm-drift-judge.md (audit PASS at META_LEDGER

Knapp-Kevin · 2026-04-29T17:13:25Z

Rebased onto current dev HEAD — 9bb5127 → 86c4d0c (force-pushed with --force-with-lease). Reviewers, please re-fetch.

Conflicts resolved (3 files)

CHANGELOG.md — PR feat: local telemetry counters + usage_summary + first-boot consent (v0.14.0) #95 cut the v0.14.0 section while we worked. Folded [P2] LLM semantic drift judge: suppress false-positive drift flags on cosmetic code changes #44's three bullets into the existing v0.14.0 `### Added` block (renamed to `### Added (continued — Issue [P2] LLM semantic drift judge: suppress false-positive drift flags on cosmetic code changes #44, LLM drift judge)` for grouping clarity), updated `### Closes` to `[P0] Telemetry Layer 1: local-only tool usage counters (privacy-first) #39, [P1] bicameral.usage_summary: aggregate diagnostic readout for the bind flow #42, [P2] LLM semantic drift judge: suppress false-positive drift flags on cosmetic code changes #44`.
docs/training/README.md — PR docs: development cycle reference + demos/guides/training scaffolding #93 landed the richer training scaffolding template (concept-warranting examples + DEV_CYCLE.md §8 link). Took PR docs: development cycle reference + demos/guides/training scaffolding #93's version verbatim; added the `Cosmetic vs semantic drift` row to its index. Soft dependency from the original PR description is now resolved.
tests/fixtures/m3_benchmark/cases.py — PR chore: CI Phase 1 — Windows matrix + ruff/mypy + secret scan + merged-to-dev labeller #102's new ruff/mypy CI reformatted the file (collapsed multiline dict entries to single lines). Took dev's reformatted version; re-added `expected_judge` field to all 10 uncertain cases.

Validation

48/48 tests in targeted sweep (8 new + 40 regression on test_m3_benchmark + drift_classifier + drift_service).
New CI gates from PR chore: CI Phase 1 — Windows matrix + ruff/mypy + secret scan + merged-to-dev labeller #102 (ruff + mypy + Windows matrix + secret scan) will run on this push.
META_LEDGER chain integrity preserved: my Entries v0.4.9 — tester mode + search status fix #15/v0.4.10 — guided mode (always-on hints, intensity-gated) #16 still chain from v0.4.8 — ingest → brief auto-chain + sync dedup guard #14 (Phase 4 SEAL `0ebcf69b`); no upstream entries between them.

Status

`MERGEABLE`. Awaiting CI on the new HEAD + Jin's L1 review.

Substantiation seal for plan-codegenome-llm-drift-judge.md (Issue BicameralAI#44, audit PASS at META_LEDGER BicameralAI#15, chain hash 536dd15f...). Verification gates (10 of 12 passed; 2 advisory skipped per capability shortfalls): - Reality vs Promise: ✓ all 4 new + 3 modified files exist - Test audit: 48/48 (8 new + 40 regression on M3 + drift_classifier + drift_service) - Razor final: all files within caps (test ≤96 LOC, no new production functions, cases.py under tests/ exclusion) - Skill file integrity: SKILL.md §2.bis structure verified - SYSTEM_STATE.md synced - Merkle seal computed: 567170e0f1dc008cd5663201d8b1582dbabb5904 - Step 4.6 reliability sweep: skipped (qor/reliability/ absent) - Step 7.5 version bump: skipped (per user direction; v0.14.0 release PR is Jin's call) Plan deviation documented: - docs/training/README.md created (not modified) on this branch because PR BicameralAI#93 scaffolding hasn't merged to dev. Minimal mirror; merges will reconcile. Operator QC pass (D6 BicameralAI#5) recorded as pending qualitative gate, not a CI blocker. Chain: 16 entries; integrity VALID. Next: /qor-document.

jinhongkuan · 2026-04-29T20:06:46Z

"the bicameral-sync skill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?)"

wouldnt this degrade performance - if we first check cosmetic vs semantic we can no-op compliance check

EDIT: okay saw that it is deterministic

Knapp-Kevin added enhancement New feature or request flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) labels Apr 29, 2026

Knapp-Kevin requested a review from jinhongkuan April 29, 2026 16:59

Knapp-Kevin added 4 commits April 29, 2026 13:09

Knapp-Kevin force-pushed the feat/44-llm-drift-judge branch from 9bb5127 to 86c4d0c Compare April 29, 2026 17:13

Knapp-Kevin temporarily deployed to ci-test April 29, 2026 17:13 — with GitHub Actions Inactive

Knapp-Kevin force-pushed the feat/44-llm-drift-judge branch from 86c4d0c to 5e96e47 Compare April 29, 2026 17:19

Knapp-Kevin temporarily deployed to ci-test April 29, 2026 17:20 — with GitHub Actions Inactive

Knapp-Kevin mentioned this pull request Apr 29, 2026

Route semantic drift and compliance verdicts into a consolidated GovernanceFinding output #110

Closed

This was referenced Apr 29, 2026

feat(#49): sticky PR-comment drift report — GitHub Action + renderer + poster #113

Merged

CI lint for unstructured references in plan files and PR bodies #114

Closed

jinhongkuan approved these changes Apr 29, 2026

View reviewed changes

Knapp-Kevin merged commit bb2e245 into BicameralAI:dev Apr 29, 2026
5 checks passed

This was referenced Apr 29, 2026

label-merged-to-dev workflow fails with 'Resource not accessible by integration' on some PRs #115

Closed

feat(#48): pre-push drift hook + branch-scan CLI subcommand #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#44): LLM drift judge — uncertain-band sub-protocol#103

feat(#44): LLM drift judge — uncertain-band sub-protocol#103
Knapp-Kevin merged 5 commits into
BicameralAI:devfrom
Knapp-Kevin:feat/44-llm-drift-judge

Knapp-Kevin commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Review skipped

Uh oh!

Knapp-Kevin commented Apr 29, 2026

Uh oh!

jinhongkuan commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Knapp-Kevin commented Apr 29, 2026

Summary

Linked issues

Plan / Audit / Seal

Test plan

What this PR is NOT

Soft dependencies (read before merging)

Files changed (7)

Reviewer ask

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Knapp-Kevin commented Apr 29, 2026

Conflicts resolved (3 files)

Validation

Status

Uh oh!

jinhongkuan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

jinhongkuan commented Apr 29, 2026 •

edited

Loading