Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,29 @@ data flows.
reroutes `~/.bicameral/` to a per-session tmp dir and sets the skip
env var. Stdlib only — no third-party fixture plugin.

### Added (continued — Issue #44, LLM drift judge)

- **`bicameral-sync` skill — uncertain-band sub-protocol (#44).** The
skill rubric now teaches the caller LLM how to consume Phase 4's
`pre_classification: uncertain` hint with a two-axis judgment:
Axis 1 (compliance) decided first; Axis 2 (cosmetic-vs-semantic)
second; signals advisory only; `evidence_refs` echoed back to the
audit trail. No new tools, no new contracts — leverages Phase 4's
existing `semantic_status` + `evidence_refs` fields on
`ComplianceVerdict` (#61). Issue #44.
- **M3 benchmark — `expected_judge` ground-truth labels.** All 10
uncertain-band cases in `tests/fixtures/m3_benchmark/cases.py`
now carry a `{verdict, semantic_status}` ground-truth pair the
operator QC pass measures LLM output against. Pure data; no
classifier or LLM behaviour change. Issue #44.
- **Training doc — `docs/training/cosmetic-vs-semantic.md`.** New
long-form walkthrough of the two-axis judgment with a worked
example from `py_12_constant_value_tuned`. Pairs with the
`bicameral-sync` skill rubric. Issue #44.

### Closes

#39, #42.
#39, #42, #44.

## v0.13.0 — CodeGenome Phase 4 (#61) — semantic drift evaluation in `resolve_compliance` (M3) — built via [QorLogic SDLC](https://github.com/MythologIQ-Labs-LLC/qor-logic)

Expand Down
115 changes: 113 additions & 2 deletions docs/META_LEDGER.md
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,117 @@ SHA256(content_hash + previous_hash) = **`0ebcf69bf25e11d9d85cb9856ccc9757ad39b7
Reality matches Promise. Phase 4 (#61) implementation conforms to the v3-audited specification with two documented plan deviations (schema renumbering and §Phase 5 fixture collapse). All 5 phases sealed in sequence; M3 benchmark exit criterion (false-positive rate < 5%) met with 0 false positives. Chain integrity intact. Next phase: `/qor-document` then open PR `claude/codegenome-phase-4-qor → BicameralAI/dev`.

---
*Chain integrity: VALID (14 entries)*
*Genesis: `29dfd085` → Phase 1+2 Seal: `509b411d` → Phase 3 Seal: `89cac7ff` → Phase 4 Audit v1 (VETO): `231fe5f1` → Phase 4 Audit v2 (PASS): `332c72b2` → Phase 4 Audit v3 (PASS, post-rebase): `21ac210f` → Phase 4 SEAL: `0ebcf69b`*

## Entry #15 — GATE TRIBUNAL: `plan-codegenome-llm-drift-judge.md` (Issue #44)

**Phase**: GATE / qor-audit
**Date**: 2026-04-29
**Branch**: `feat/44-llm-drift-judge` (off `BicameralAI/dev` post-Phase-4 seal)
**Subject**: Issue #44 — *[P2] LLM semantic drift judge: suppress false-positive drift flags on cosmetic code changes*
**Risk Grade**: L1 (docs + skill rubric + test data; zero production code paths)
**Change Class**: minor

### Audit history (this entry covers both iterations)

| v | Plan commit | Verdict | Findings |
|---|---|---|---|
| v1 | `b15c9ef` | **VETO** | F-1 (BLOCKING): `pilot/mcp/skills/bicameral-sync/SKILL.md` does not exist on dev — plan inherited stale `CLAUDE.md` claim without filesystem verification. SG-PLAN-GROUNDING-DRIFT instance #2. F-2/F-3: minor grounding numerics. F-4/F-5: non-blocking. |
| v2 | `d846a4a` | **PASS** | All blocking findings remediated. Pilot path directive removed; test count 5→4; SKILL.md baseline 138→150; ruff exemption claim corrected. |

### Plan content hash (v2)

`sha256:7094b9b64339e1bf2d96055fac1bd46dec066966fbf690687c129d02fb5ae74d`

### Audit report content hash

`sha256:bc74936e79eff03bdae0dda2d7ab419044328978814643b99ecfa5ee8fa2b6a1`

### Previous chain hash

`0ebcf69bf25e11d9d85cb9856ccc9757ad39b75c2f352bdd063bd2d957f506cf` (Entry #14, Phase 4 SEAL)

### Chain hash

`SHA256(plan_hash + audit_hash + prev_hash) =` **`536dd15f587d749cb600999171e0889fbe20f39818bf3969890f411ff0fe350b`**

### Decision

PASS. Chain to `/qor-implement` per delegation table. Plan declares 2 phases (test corpus + skill rubric), 0 production code changes, 0 schema migrations, 0 new dependencies.

### Shadow Genome instance recorded

`SG-PLAN-GROUNDING-DRIFT` instance #2 catalogued in `docs/SHADOW_GENOME.md`. Cross-references PR #93 (instance #1, same root cause: CLAUDE.md asserts a `pilot/mcp/skills/` location that does not exist on dev). Followup: separate `docs:claude-md-cleanup` issue tracked outside this plan.

---

## Entry #16 — SUBSTANTIATION SEAL: `plan-codegenome-llm-drift-judge.md` (Issue #44)

**Phase**: SUBSTANTIATE / qor-substantiate
**Date**: 2026-04-29
**Branch**: `feat/44-llm-drift-judge` (off `BicameralAI/dev` post-Phase-4 seal `200dbd5`)
**Implementation commit**: `f230331`
**Risk Grade**: L1 (docs + skill rubric + test data; zero production code paths changed)
**Change Class**: minor

### Verification gates

| Step | Check | Result | Notes |
|---|---|---|---|
| Step 2 | PASS verdict in AUDIT_REPORT.md | ✅ | Entry #15 audit PASS at `536dd15f`. |
| Step 2.5 | Version validation | ✅ | Source remains v0.13.x; no version bump in this PR per user direction (v0.14.0 release PR is Jin's call). |
| Step 3 | Reality vs Promise | ✅ | All 4 new files + 3 modified files exist. One documented deviation: `docs/training/README.md` was created (not modified) because PR #93 scaffolding hasn't merged yet — minimal mirror created on this branch. |
| Step 3.5 | Backlog blockers | ✅ | No new security blockers; pre-existing S1 (SECURITY.md absent) unaffected. |
| Step 4 | Test audit | ✅ | 48/48 in targeted sweep (8 new + 40 regression on test_m3_benchmark + drift_classifier + drift_service). |
| Step 4 (artifacts) | console.log / print() in NEW production code | ✅ | Zero new production code added; pre-existing `print()` in handlers/update.py unchanged (CLI JSON output, by design). |
| Step 4.5 | Skill file integrity | ✅ | `skills/bicameral-sync/SKILL.md` modified — required structure preserved (frontmatter, headings, rules). Section 4 `2.bis` correctly placed between Step 2 and Step 3 ("Report"). |
| Step 4.6 | Reliability sweep | ⚠️ skip | `qor/reliability/` capability shortfall; intent-lock + skill-admission + gate-skill-matrix scripts absent. |
| Step 5 | Section 4 razor final | ✅ | All NEW files: test_m3_benchmark_judge_corpus.py 83 LOC, test_skill_uncertain_protocol.py 96 LOC, training docs 198+49 LOC (markdown). MODIFIED: SKILL.md 211 LOC (markdown), cases.py 431 LOC (under tests/ ruff exclusion). All test functions ≤ 25 LOC. Zero new production functions. |
| Step 6 | SYSTEM_STATE.md sync | ✅ | Updated below; #44 snapshot prepended; Phase 4 inventory preserved. |
| Step 7 | Merkle seal | ✅ | Computed below. |
| Step 7.5 | Annotated tag | ⚠️ skip | Per user direction, no version bump in this PR. v0.14.0 tag deferred to Jin's release PR. |

### Architectural decisions sealed

D1 (skill-side judge), D2 (caching free via existing compliance_check), D3-D4 (reuses existing typed contracts), D5 (rubric is markdown text), D6 (5 exit criteria). No design deviations during implementation.

### Operator QC pass (D6 #5)

Recorded as **pending qualitative gate**, NOT a CI-blocker. The operator will run the `bicameral-sync` skill against the 10 uncertain-band cases and compare LLM verdicts to `expected_judge` ground truth in `tests/fixtures/m3_benchmark/cases.py`. Pass threshold: 0% FP on cosmetic-claimed verdicts; ≤ 20% FN. Threshold met / not met to be recorded by the operator post-merge as a separate META_LEDGER addendum or comment on the PR.

### Plan deviations (documented)

1. **`docs/training/README.md` created (not modified)**. PR #93's docs/training/ scaffolding hasn't merged to dev. Minimal training README mirrored on this branch; merges will reconcile. Soft dependency disclosed in PR body.

### Carried-forward observations

- **SG-PLAN-GROUNDING-DRIFT instance #2** (META_LEDGER #15 / SHADOW_GENOME #5): `pilot/mcp/skills/` referenced by CLAUDE.md but does not exist on dev. Plan post-remediation correctly drops the reference. Followup `docs:claude-md-cleanup` workstream filed separately (NOT in scope for #44).

### Capability shortfalls (carried)

- `qor/scripts/` runtime helpers absent — gate-chain artifacts at `.qor/gates/<session>/*.json` not written. File-based META_LEDGER chain remains canonical.
- `qor/reliability/` enforcement scripts absent — Step 4.6 reliability sweep skipped.
- `agent-teams` capability not declared on Claude Code host — sequential mode.
- `codex-plugin` capability not declared — solo audit mode.
- `AUDIT_REPORT.md` lives at `.agent/staging/` rather than `.failsafe/governance/`. Path divergence noted; chain integrity preserved.

### Session content hash

SHA256 over 8 sorted-path files (plan + skill + training doc + 2 test files + cases.py + training README + SYSTEM_STATE.md) =
**`a952c0140a142dbd60f9239b4786bc4a498bac98441e157f0b19bc2eb8f4dc1d`**

### Previous chain hash

`536dd15f587d749cb600999171e0889fbe20f39818bf3969890f411ff0fe350b` (Entry #15, audit PASS post-remediation)

### Merkle seal

SHA256(content_hash + previous_hash) = **`567170e0f1dc008cd5663201d8b1582dbabb5904527acb31ed5ea869b1cd8877`**

### Decision

**Reality matches Promise.** Implementation conforms to the audited specification (`d846a4a`) with one documented plan deviation (training README scaffolding). Phase 1 (test corpus extension) and Phase 2 (skill rubric + training doc) sealed in sequence; 8/8 new tests + 40/40 regression green. Chain integrity intact. Next phase: `/qor-document` then open PR `feat/44-llm-drift-judge → BicameralAI/dev`.

---
*Chain integrity: VALID (16 entries)*
*Genesis: `29dfd085` → Phase 1+2 Seal: `509b411d` → Phase 3 Seal: `89cac7ff` → Phase 4 Audit v1 (VETO): `231fe5f1` → Phase 4 Audit v2 (PASS): `332c72b2` → Phase 4 Audit v3 (PASS, post-rebase): `21ac210f` → Phase 4 SEAL: `0ebcf69b` → #44 Audit (PASS, post-remediation): `536dd15f` → #44 SEAL: `567170e0`*
*Next required action: `/qor-document` then open PR to `BicameralAI/dev`*
52 changes: 52 additions & 0 deletions docs/SHADOW_GENOME.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,55 @@ codebase before issuing PASS. The grounding sweep is non-optional
for L2 plans that touch schema or extend an existing module API.

---

## Failure Entry #5

**Date**: 2026-04-29
**Phase**: GATE / qor-audit (v1 of #44 plan, commit `b15c9ef`)
**Pattern**: SG-PLAN-GROUNDING-DRIFT (instance #2 in this session)

### What happened

Plan `plan-codegenome-llm-drift-judge.md` (v1) instructed the implementer to modify `pilot/mcp/skills/bicameral-sync/SKILL.md` and added a unit test (`test_pilot_skill_md_matches_skills_skill_md`) that diffed two copies of SKILL.md across `skills/` and `pilot/mcp/skills/`. The plan author (this session) inherited the claim from `CLAUDE.md` ("`pilot/mcp/skills/` is the **single canonical location**") without empirically verifying it.

Reality on `dev` HEAD (`200dbd5`):

```
$ ls pilot/
ls: cannot access 'pilot/': No such file or directory
```

The directory does not exist. The plan was unimplementable as written.

### Detection

Audit Step 3 — orphan detection pass — flagged `pilot/mcp/skills/bicameral-sync/SKILL.md` as a build-path orphan. Backwalking to the plan revealed it was a directive, not a typo; a literal `ls` confirmed the directory's absence.

### Mitigation

1. v2 of the plan (commit `d846a4a`) removed the directive, removed the matching test, and added a rationale note identifying CLAUDE.md's reference as stale.
2. Plan author should `ls` every directory it proposes to modify before issuing the plan, not trust `CLAUDE.md` verbatim for filesystem layout.
3. Auditor's orphan detection should run on every plan, not just code-bearing ones.

### Cross-references

- **Instance #1**: `DEV_CYCLE.md` §9 (PR #93) absorbed the same `pilot/mcp/skills/` reference into a "skill file rule (project-specific, mandatory)" callout. Same root cause; landed undetected because PR #93 was a docs PR with no orphan check.
- **Followup workstream**: `docs:claude-md-cleanup` issue (to be filed) — fixes `CLAUDE.md` itself so future plans don't keep inheriting the stale assertion.

### Pattern signature

```
SG-PLAN-GROUNDING-DRIFT
Trigger: plan author trusts a documented assertion about
filesystem state without empirical verification.
Failure mode: plan instructs work on files that don't exist;
unit test references nonexistent path; orphan
detection catches it at audit (best case) or
implementation runtime (worst case).
Countermeasure: every directory cited in a plan's "affected
files" section must be `ls`-confirmed before
the plan is submitted for audit. Add a Step 2b
Grounding Protocol clause if not already present.
```

---
60 changes: 55 additions & 5 deletions docs/SYSTEM_STATE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,61 @@
# System State — post-Phase-4-substantiation snapshot
# System State — post-#44-substantiation snapshot

**Generated**: 2026-04-29
**HEAD**: `09f30a8` (Phase 4 / #61 sealed; rebased on `dev` after #71/#73/#79–#84 merged)
**Branch**: `claude/codegenome-phase-4-qor`
**Tracked PR**: targets `BicameralAI/dev` (Phase 4 / Issue #61); aggregate `dev → main` PR is downstream
**HEAD**: `f230331` (#44 implementation sealed)
**Branch**: `feat/44-llm-drift-judge` (off `BicameralAI/dev` post-Phase-4 seal `200dbd5`)
**Tracked PR**: will target `BicameralAI/dev` (Issue #44); aggregate `dev → main` PR is downstream
**Genesis hash**: `29dfd085...`
**Phase 4 seal**: `0ebcf69b...`
**#44 seal**: see Entry #16 (computed during this substantiation)

## #44 (LLM drift judge) implementation — 7 files, ~549 LOC, 8 new tests, 40/40 targeted regression

| Phase | Files | New tests | Commit |
|---|---|---|---|
| 1 — M3 benchmark `expected_judge` ground-truth labels | 1 new + 1 modified | 4 | `f230331` |
| 2 — bicameral-sync §2.bis Uncertain-band sub-protocol + training doc | 1 new test + 1 modified skill + 2 new docs | 4 | `f230331` |

### Files in scope

**New** (5):
- `tests/test_m3_benchmark_judge_corpus.py` (83 LOC, 4 tests)
- `tests/test_skill_uncertain_protocol.py` (96 LOC, 4 tests)
- `docs/training/cosmetic-vs-semantic.md` (198 LOC, training doc)
- `docs/training/README.md` (49 LOC, training index — soft-deps on PR #93)
- `plan-codegenome-llm-drift-judge.md` (417 LOC, plan; committed at `b15c9ef`/`d846a4a`)

**Modified** (3):
- `tests/fixtures/m3_benchmark/cases.py` (391 → 431 LOC, expected_judge added to 10 uncertain cases)
- `skills/bicameral-sync/SKILL.md` (150 → 211 LOC, §2.bis Uncertain-band sub-protocol)
- `CHANGELOG.md` ([Unreleased] entry under Added)

### Plan deviations (documented)

1. **`docs/training/README.md` created on this branch** rather than modified — the PR #93 docs scaffolding hasn't merged to dev yet, so the training/ directory was empty on the fork-point. Created a minimal version that mirrors PR #93's intended structure; merges will reconcile via standard merge when one or both PRs land.

### Architectural decisions retained from plan (D1-D6)

- **D1**: skill-side judge (caller LLM), not server-side. Preserves docs/CONCEPT.md anti-goal "Not an LLM-powered ledger".
- **D2**: caching via existing `compliance_check` writes (Phase 4 added `semantic_status` + `evidence_refs`).
- **D3-D4**: reuses existing typed contracts (`PreClassificationHint`, `ComplianceVerdict`); no new fields.
- **D5**: rubric is data (markdown text in SKILL.md §2.bis), not code.
- **D6**: 5 exit criteria, 4 CI-checkable + 1 operator QC pass (qualitative gate).

### Capability shortfalls (carried across phases)

- `qor/scripts/` runtime helpers absent — gate-chain artifacts not written.
- `qor/reliability/` enforcement scripts absent — Step 4.6 reliability sweep skipped.
- `agent-teams` capability not declared — sequential mode.
- `codex-plugin` capability not declared — solo audit mode.
- Audit found `pilot/mcp/skills/` referenced by CLAUDE.md but does not exist on dev (SG-PLAN-GROUNDING-DRIFT instance #2 — META_LEDGER #15, SHADOW_GENOME #5). Plan post-remediation correctly drops the reference; followup workstream `docs:claude-md-cleanup` filed separately.

### Test state (post-implementation)

- Targeted sweep: 40/40 (8 new + 32 regression on test_m3_benchmark.py + test_codegenome_drift_classifier.py + test_codegenome_drift_service.py).
- All test functions ≤ 25 LOC.
- All test files ≤ 96 LOC.
- `cases.py` 431 LOC under tests/ ruff exclusion (pyproject.toml `exclude = ["tests", ...]`).

---

## Phase 4 (#61) implementation — 27 files, ~2515 LOC, 73 new tests, 189/189 regression

Expand Down
2 changes: 1 addition & 1 deletion docs/training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ the feature introduces a concept, not just a tool).

| Topic | Status |
|---|---|
| (none yet) | |
| [Cosmetic vs semantic drift](./cosmetic-vs-semantic.md) | Active |

## Template

Expand Down
Loading
Loading