feat(#44): LLM drift judge — uncertain-band sub-protocol#103
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Issue BicameralAI#44 (P2) plan, targeting v0.14.0. Three architectural calls: D1. Skill-side, not server-side — preserves the local-first / LLM-free-server anti-goal in docs/CONCEPT.md. D2. Caching is already free via Phase 4's compliance_check writes (semantic_status + evidence_refs persisted). D3-D4. Reuses existing typed contracts — no new fields, no new tools, no schema changes. The judge maps to the existing two-axis output (verdict + semantic_status). D5. The rubric is data — text in SKILL.md — not Python code. The LLM follows it during the uncertain-band sub-protocol. D6. Five exit criteria, four CI-checkable + one operator QC pass. Two phases: - Phase 1: M3 benchmark corpus extension — every uncertain case gains expected_judge: {verdict, semantic_status}. Pure data. - Phase 2: Skill rubric — bicameral-sync SKILL.md gains an Uncertain-band sub-protocol section + paired training doc in docs/training/cosmetic-vs-semantic.md. Risk grade L1 (skill rubric + docs + test data; no production code paths). Telemetry surface (acceptance criterion 3 of BicameralAI#44) deferred to a follow-up gated on PR BicameralAI#95 landing — flagged in plan Open Question 3. Branched off BicameralAI/dev post-Phase-4 seal (200dbd5).
Audit found: - F-1 (BLOCKING): pilot/mcp/skills/bicameral-sync/SKILL.md does not exist on dev; plan instruction to modify it was unimplementable. Trusted CLAUDE.md's stale 'canonical' claim over filesystem empirics. SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 in this session. - F-2: SKILL.md baseline 138 LOC was wrong (actual 150). - F-3: ruff exclusion is `tests` directory-wide, not just `tests/fixtures/**`. Remediation: - Drop the pilot/mcp affected-file bullet from Phase 2. - Drop test_pilot_skill_md_matches_skills_skill_md from Phase 2 test count (5 → 4 tests, ~70 → ~60 LOC). - Add explanatory note: CLAUDE.md's pilot/mcp/skills/ reference is stale; flagged for separate cleanup workstream. - Update SKILL.md baseline 138 → 150 LOC (target 200 not 190). - Reword ruff exemption claim from `tests/fixtures/**` to "entire `tests/` directory". The plan's intent is preserved: skill rubric extension + test corpus expansion + training doc, all on the empirically canonical skills/ path. Re-audit pending.
…ameralAI#5 — audit PASS - META_LEDGER BicameralAI#15: GATE TRIBUNAL entry covering both audit iterations (v1 VETO at b15c9ef, v2 PASS at d846a4a). Chain hash 536dd15f extends from BicameralAI#14 Phase 4 SEAL. - SHADOW_GENOME BicameralAI#5: SG-PLAN-GROUNDING-DRIFT instance BicameralAI#2 catalogued. Cross-references PR BicameralAI#93 §9 as instance #1 (same root cause: CLAUDE.md asserts pilot/mcp/skills/ canonicality but dev HEAD has no pilot/ directory). Followup: docs:claude-md-cleanup workstream to fix CLAUDE.md itself. Plan PASS at d846a4a; chain to /qor-implement.
Phase 1 — M3 benchmark judge-corpus extension: - tests/test_m3_benchmark_judge_corpus.py (4 tests, 83 LOC) - tests/fixtures/m3_benchmark/cases.py — expected_judge field added to all 10 uncertain cases (pure data, ground-truth labels for the operator QC pass) Phase 2 — bicameral-sync skill rubric + training doc: - tests/test_skill_uncertain_protocol.py (4 tests, 96 LOC) - skills/bicameral-sync/SKILL.md §2.bis — Uncertain-band sub-protocol section: Axis 1 (compliance) FIRST, Axis 2 (cosmetic-vs-semantic) SECOND, signals advisory, evidence_refs echoed back. Maps to existing typed contracts (no new fields). - docs/training/cosmetic-vs-semantic.md (198 LOC) — concept doc with worked example from py_12_constant_value_tuned. Pairs with the rubric. - docs/training/README.md — index with cosmetic-vs-semantic active row. Soft-depends on PR BicameralAI#93's docs/training/ scaffolding; this branch creates a minimal version that PR BicameralAI#93 will reconcile on merge. Validation: - Phase 1 + Phase 2 new tests: 8/8 green. - M3 + drift_classifier regression: 32/32 green. - Total: 40/40 green in the targeted sweep. Razor: - All test files ≤ 96 LOC (cap 250). - All test functions ≤ 25 LOC (cap 40). - cases.py 431 LOC under tests/ ruff exclusion. - No production code changes; no schema changes; no new contracts; no new tools; no new dependencies. CHANGELOG: [Unreleased] entry added under Added. Plan: plan-codegenome-llm-drift-judge.md (audit PASS at META_LEDGER
9bb5127 to
86c4d0c
Compare
|
Rebased onto current Conflicts resolved (3 files)
Validation
Status`MERGEABLE`. Awaiting CI on the new HEAD + Jin's L1 review. |
Substantiation seal for plan-codegenome-llm-drift-judge.md (Issue BicameralAI#44, audit PASS at META_LEDGER BicameralAI#15, chain hash 536dd15f...). Verification gates (10 of 12 passed; 2 advisory skipped per capability shortfalls): - Reality vs Promise: ✓ all 4 new + 3 modified files exist - Test audit: 48/48 (8 new + 40 regression on M3 + drift_classifier + drift_service) - Razor final: all files within caps (test ≤96 LOC, no new production functions, cases.py under tests/ exclusion) - Skill file integrity: SKILL.md §2.bis structure verified - SYSTEM_STATE.md synced - Merkle seal computed: 567170e0f1dc008cd5663201d8b1582dbabb5904 - Step 4.6 reliability sweep: skipped (qor/reliability/ absent) - Step 7.5 version bump: skipped (per user direction; v0.14.0 release PR is Jin's call) Plan deviation documented: - docs/training/README.md created (not modified) on this branch because PR BicameralAI#93 scaffolding hasn't merged to dev. Minimal mirror; merges will reconcile. Operator QC pass (D6 BicameralAI#5) recorded as pending qualitative gate, not a CI blocker. Chain: 16 entries; integrity VALID. Next: /qor-document.
86c4d0c to
5e96e47
Compare
|
"the bicameral-sync skill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?)" wouldnt this degrade performance - if we first check cosmetic vs semantic we can no-op compliance check EDIT: okay saw that it is deterministic |
Summary
bicameral-syncskill now applies a two-axis rubric: Axis 1 (compliance — does the new code meet the decision?) decided FIRST, then Axis 2 (cosmetic vs semantic — was the change behaviour-preserving?). No new tools, no new contracts — leverages Phase 4'ssemantic_status+evidence_refsfields already onComplianceVerdict.expected_judgeground-truth labels to the M3 benchmark uncertain-band cases. These are the targets the operator QC pass measures the LLM judge against post-merge.docs/training/cosmetic-vs-semantic.mdwalks the concept end-to-end with a worked example frompy_12_constant_value_tuned. Pairs with the SKILL.md rubric.Linked issues
Closes #44
Refs:
Plan / Audit / Seal
plan-codegenome-llm-drift-judge.md(commitd846a4a, content hashsha256:7094b9b6…)536dd15f…. Audit history: v1 VETO atb15c9ef(F-1 grounding error:pilot/mcp/skills/referenced by CLAUDE.md but does not exist on dev — SG-PLAN-GROUNDING-DRIFT instance feat: event-sourced collaboration (Phase 1) #2, catalogued at SHADOW_GENOME decision grounding reuse + coverage loop #5); v2 PASS atd846a4aafter mechanical remediation.sha256:567170e0…. Reality matches Promise; one documented plan deviation (training README created vs modified — see "Soft dependencies").Test plan
pytest tests/test_m3_benchmark_judge_corpus.py -q(4/4 — Phase 1 corpus contract)pytest tests/test_skill_uncertain_protocol.py -q(4/4 — Phase 2 SKILL.md conformance)pytest tests/test_m3_benchmark.py -q(regression on existing M3 — no behaviour change expected)pytest tests/test_codegenome_drift_classifier.py tests/test_codegenome_drift_service.py -q(regression on Phase 4 — data-only plan should not affect classifier)bicameral-syncskill against the 10 uncertain-band cases; compare LLM verdicts toexpected_judgeground truth. Pass thresholds per plan §D6 decision grounding reuse + coverage loop #5: 0% FP on cosmetic-claimed verdicts, ≤ 20% FN. Record outcome as PR comment or META_LEDGER addendum. Qualitative gate, not CI-blocking.What this PR is NOT
docs/CONCEPT.mdanti-goal "Not an LLM-powered ledger".bicameral.usage_summary cosmetic_drift_pct). That gates on PR feat: local telemetry counters + usage_summary + first-boot consent (v0.14.0) #95 landing first; deferred to a follow-up issue.Soft dependencies (read before merging)
docs/training/README.mdas a minimal index because PR docs: development cycle reference + demos/guides/training scaffolding #93 — which scaffolds the same file with a richer template — hasn't merged to dev yet. Standard merge resolution will reconcile when one or both PRs land. No reviewer action needed; flagged here so the file isn't read as an unauthorised addition.Files changed (7)
tests/test_m3_benchmark_judge_corpus.pytests/test_skill_uncertain_protocol.pydocs/training/cosmetic-vs-semantic.mddocs/training/README.mdtests/fixtures/m3_benchmark/cases.pyskills/bicameral-sync/SKILL.mdCHANGELOG.md[Unreleased]entry added)Plus governance updates (META_LEDGER #15-16, SYSTEM_STATE.md #44 snapshot, SHADOW_GENOME #5).
Reviewer ask
@jinhongkuan — the SKILL.md rubric is the contract the bicameral-sync caller LLM follows. Worth one read for tone/clarity. Risk grade L1 (single reviewer per
DEV_CYCLE.md§4.4); no security-pass note needed.Specifically: §2.bis as written calls Axis 1 (compliance) the first judgment and treats
not_relevantas a short-circuit that skips Axis 2. Verify this matches your mental model — if you'd rather see the axes evaluated in parallel and reconciled at emit time, this is the place to say so.