Skip to content

feat(#44): LLM drift judge — uncertain-band sub-protocol#103

Merged
Knapp-Kevin merged 5 commits into
BicameralAI:devfrom
Knapp-Kevin:feat/44-llm-drift-judge
Apr 29, 2026
Merged

feat(#44): LLM drift judge — uncertain-band sub-protocol#103
Knapp-Kevin merged 5 commits into
BicameralAI:devfrom
Knapp-Kevin:feat/44-llm-drift-judge

Conversation

@Knapp-Kevin

Copy link
Copy Markdown
Collaborator

Summary

  • Extends Phase 4's deterministic drift classifier with the LLM-judge layer. When a region scores in the uncertain band [0.30, 0.80), the bicameral-sync skill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?). No new tools, no new contracts — leverages Phase 4's semantic_status + evidence_refs fields already on ComplianceVerdict.
  • Adds 10 hand-authored expected_judge ground-truth labels to the M3 benchmark uncertain-band cases. These are the targets the operator QC pass measures the LLM judge against post-merge.
  • New training doc docs/training/cosmetic-vs-semantic.md walks the concept end-to-end with a worked example from py_12_constant_value_tuned. Pairs with the SKILL.md rubric.

Linked issues

Closes #44

Refs:

Plan / Audit / Seal

Test plan

  • pytest tests/test_m3_benchmark_judge_corpus.py -q (4/4 — Phase 1 corpus contract)
  • pytest tests/test_skill_uncertain_protocol.py -q (4/4 — Phase 2 SKILL.md conformance)
  • pytest tests/test_m3_benchmark.py -q (regression on existing M3 — no behaviour change expected)
  • pytest tests/test_codegenome_drift_classifier.py tests/test_codegenome_drift_service.py -q (regression on Phase 4 — data-only plan should not affect classifier)
  • Targeted sweep total: 48/48 (8 new + 40 regression)
  • Operator QC pass (post-merge): run bicameral-sync skill against the 10 uncertain-band cases; compare LLM verdicts to expected_judge ground truth. Pass thresholds per plan §D6 decision grounding reuse + coverage loop #5: 0% FP on cosmetic-claimed verdicts, ≤ 20% FN. Record outcome as PR comment or META_LEDGER addendum. Qualitative gate, not CI-blocking.

What this PR is NOT

Soft dependencies (read before merging)

Files changed (7)

File Change LOC
tests/test_m3_benchmark_judge_corpus.py NEW 83
tests/test_skill_uncertain_protocol.py NEW 96
docs/training/cosmetic-vs-semantic.md NEW 198
docs/training/README.md NEW (mirrors PR #93 scaffolding) 49
tests/fixtures/m3_benchmark/cases.py MODIFIED (+40 LOC; expected_judge on 10 uncertain cases) 391 → 431
skills/bicameral-sync/SKILL.md MODIFIED (+61 LOC; §2.bis Uncertain-band sub-protocol) 150 → 211
CHANGELOG.md MODIFIED ([Unreleased] entry added)

Plus governance updates (META_LEDGER #15-16, SYSTEM_STATE.md #44 snapshot, SHADOW_GENOME #5).

Reviewer ask

@jinhongkuan — the SKILL.md rubric is the contract the bicameral-sync caller LLM follows. Worth one read for tone/clarity. Risk grade L1 (single reviewer per DEV_CYCLE.md §4.4); no security-pass note needed.

Specifically: §2.bis as written calls Axis 1 (compliance) the first judgment and treats not_relevant as a short-circuit that skips Axis 2. Verify this matches your mental model — if you'd rather see the axes evaluated in parallel and reconciled at emit time, this is the place to say so.

@Knapp-Kevin Knapp-Kevin added enhancement New feature or request flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow) labels Apr 29, 2026
@Knapp-Kevin Knapp-Kevin requested a review from jinhongkuan April 29, 2026 16:59
@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76c2f767-4508-49be-a5bd-c43aafef502b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Issue BicameralAI#44 (P2) plan, targeting v0.14.0. Three architectural calls:

D1. Skill-side, not server-side — preserves the local-first /
    LLM-free-server anti-goal in docs/CONCEPT.md.
D2. Caching is already free via Phase 4's compliance_check writes
    (semantic_status + evidence_refs persisted).
D3-D4. Reuses existing typed contracts — no new fields, no new
       tools, no schema changes. The judge maps to the existing
       two-axis output (verdict + semantic_status).
D5. The rubric is data — text in SKILL.md — not Python code. The
    LLM follows it during the uncertain-band sub-protocol.
D6. Five exit criteria, four CI-checkable + one operator QC pass.

Two phases:
- Phase 1: M3 benchmark corpus extension — every uncertain case
  gains expected_judge: {verdict, semantic_status}. Pure data.
- Phase 2: Skill rubric — bicameral-sync SKILL.md gains an
  Uncertain-band sub-protocol section + paired training doc
  in docs/training/cosmetic-vs-semantic.md.

Risk grade L1 (skill rubric + docs + test data; no production
code paths). Telemetry surface (acceptance criterion 3 of BicameralAI#44)
deferred to a follow-up gated on PR BicameralAI#95 landing — flagged in
plan Open Question 3.

Branched off BicameralAI/dev post-Phase-4 seal (200dbd5).
Audit found:
- F-1 (BLOCKING): pilot/mcp/skills/bicameral-sync/SKILL.md does not
  exist on dev; plan instruction to modify it was unimplementable.
  Trusted CLAUDE.md's stale 'canonical' claim over filesystem
  empirics. SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 in this session.
- F-2: SKILL.md baseline 138 LOC was wrong (actual 150).
- F-3: ruff exclusion is `tests` directory-wide, not just
  `tests/fixtures/**`.

Remediation:
- Drop the pilot/mcp affected-file bullet from Phase 2.
- Drop test_pilot_skill_md_matches_skills_skill_md from Phase 2
  test count (5 → 4 tests, ~70 → ~60 LOC).
- Add explanatory note: CLAUDE.md's pilot/mcp/skills/ reference is
  stale; flagged for separate cleanup workstream.
- Update SKILL.md baseline 138 → 150 LOC (target 200 not 190).
- Reword ruff exemption claim from `tests/fixtures/**` to "entire
  `tests/` directory".

The plan's intent is preserved: skill rubric extension + test
corpus expansion + training doc, all on the empirically canonical
skills/ path. Re-audit pending.
…ameralAI#5 — audit PASS

- META_LEDGER BicameralAI#15: GATE TRIBUNAL entry covering both audit
  iterations (v1 VETO at b15c9ef, v2 PASS at d846a4a). Chain
  hash 536dd15f extends from BicameralAI#14 Phase 4 SEAL.
- SHADOW_GENOME BicameralAI#5: SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 catalogued.
  Cross-references PR BicameralAI#93 §9 as instance #1 (same root cause:
  CLAUDE.md asserts pilot/mcp/skills/ canonicality but dev HEAD
  has no pilot/ directory). Followup: docs:claude-md-cleanup
  workstream to fix CLAUDE.md itself.

Plan PASS at d846a4a; chain to /qor-implement.
Phase 1 — M3 benchmark judge-corpus extension:
- tests/test_m3_benchmark_judge_corpus.py (4 tests, 83 LOC)
- tests/fixtures/m3_benchmark/cases.py — expected_judge field added
  to all 10 uncertain cases (pure data, ground-truth labels for
  the operator QC pass)

Phase 2 — bicameral-sync skill rubric + training doc:
- tests/test_skill_uncertain_protocol.py (4 tests, 96 LOC)
- skills/bicameral-sync/SKILL.md §2.bis — Uncertain-band
  sub-protocol section: Axis 1 (compliance) FIRST, Axis 2
  (cosmetic-vs-semantic) SECOND, signals advisory, evidence_refs
  echoed back. Maps to existing typed contracts (no new fields).
- docs/training/cosmetic-vs-semantic.md (198 LOC) — concept doc
  with worked example from py_12_constant_value_tuned. Pairs
  with the rubric.
- docs/training/README.md — index with cosmetic-vs-semantic
  active row. Soft-depends on PR BicameralAI#93's docs/training/ scaffolding;
  this branch creates a minimal version that PR BicameralAI#93 will reconcile
  on merge.

Validation:
- Phase 1 + Phase 2 new tests: 8/8 green.
- M3 + drift_classifier regression: 32/32 green.
- Total: 40/40 green in the targeted sweep.

Razor:
- All test files ≤ 96 LOC (cap 250).
- All test functions ≤ 25 LOC (cap 40).
- cases.py 431 LOC under tests/ ruff exclusion.
- No production code changes; no schema changes; no new contracts;
  no new tools; no new dependencies.

CHANGELOG: [Unreleased] entry added under Added.

Plan: plan-codegenome-llm-drift-judge.md (audit PASS at META_LEDGER
@Knapp-Kevin

Copy link
Copy Markdown
Collaborator Author

Rebased onto current dev HEAD9bb512786c4d0c (force-pushed with --force-with-lease). Reviewers, please re-fetch.

Conflicts resolved (3 files)

Validation

Status

`MERGEABLE`. Awaiting CI on the new HEAD + Jin's L1 review.

Substantiation seal for plan-codegenome-llm-drift-judge.md (Issue
BicameralAI#44, audit PASS at META_LEDGER BicameralAI#15, chain hash 536dd15f...).

Verification gates (10 of 12 passed; 2 advisory skipped per
capability shortfalls):
- Reality vs Promise: ✓ all 4 new + 3 modified files exist
- Test audit: 48/48 (8 new + 40 regression on M3 + drift_classifier
  + drift_service)
- Razor final: all files within caps (test ≤96 LOC, no new
  production functions, cases.py under tests/ exclusion)
- Skill file integrity: SKILL.md §2.bis structure verified
- SYSTEM_STATE.md synced
- Merkle seal computed: 567170e0f1dc008cd5663201d8b1582dbabb5904
- Step 4.6 reliability sweep: skipped (qor/reliability/ absent)
- Step 7.5 version bump: skipped (per user direction; v0.14.0
  release PR is Jin's call)

Plan deviation documented:
- docs/training/README.md created (not modified) on this branch
  because PR BicameralAI#93 scaffolding hasn't merged to dev. Minimal mirror;
  merges will reconcile.

Operator QC pass (D6 BicameralAI#5) recorded as pending qualitative gate, not
a CI blocker.

Chain: 16 entries; integrity VALID. Next: /qor-document.
@jinhongkuan

jinhongkuan commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

"the bicameral-sync skill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?)"

wouldnt this degrade performance - if we first check cosmetic vs semantic we can no-op compliance check

EDIT: okay saw that it is deterministic

@Knapp-Kevin Knapp-Kevin merged commit bb2e245 into BicameralAI:dev Apr 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request flow:feature Standard feature/fix PR targeting BicameralAI/dev (the default flow)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants