Skip to content

Preflight eval: §C cost/latency baseline#90

Merged
jinhongkuan merged 3 commits into
mainfrom
preflight-cost-baseline
Apr 29, 2026
Merged

Preflight eval: §C cost/latency baseline#90
jinhongkuan merged 3 commits into
mainfrom
preflight-cost-baseline

Conversation

@silongtan

@silongtan silongtan commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Prerequisite for any optimization PR against #58 to have a regression target.

Scope evolved during review: original C3 latency measurement used mocked ledger queries, which Jin correctly flagged as not capturing real updates. Reworked to real memory:// SurrealDB seeded with synthetic data through the production ingest_payload path — every C2/C3 measurement now exercises the real SurrealDB query plan, handler iteration, and serialization.

Metrics + baselines

Metric What N=10 N=100 N=1000
C1 bicameral.history() payload tokens 7,574 79,025 795,982
C2 bicameral.preflight() response tokens 566 571 575
C2 bicameral.preflight() response bytes 2,303 2,303 2,303
C3 Handler latency p50 2.5ms 14.8ms 138.8ms
C3 Handler latency p95 3.0ms 15.9ms 141.7ms

Two punch lines from the data:

  1. C1 N=1000 = 796K tokens — a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. The §C concern is concrete and material.
  2. C3 N=1000 = ~140ms — real-ledger preflight at scale crosses into user-perceptible latency. That's the user-experience surface an optimization PR should reduce.

What this test actually catches

  • Future PR adds a verbose field to HistoryResponse → C1 token count grows → flag
  • Future PR changes the SurrealDB query plan to scan instead of index lookup → C3 latency grows → flag
  • Future PR bloats PreflightResponse shape → C2 byte count grows → flag
  • Future optimization (semantic prefilter, lazy/two-pass history from M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58) reduces C1/C3 → measurable win
  • Asymmetric rule means improvements never alert; only regressions trip

What this test explicitly does NOT catch

The three together form a measurement strategy: phase 3 is the synthetic baseline that holds optimization claims accountable; #65 + #66 cover what synthetic doesn't.

Architecture

Component Lines Purpose
tests/eval/_synthetic_ledger.py ~195 Deterministic generator producing HistoryResponse-shaped dicts
tests/eval/_seed_ledger.py ~120 Translates synthetic dict → real SurrealDB writes via adapter.ingest_payload
tests/eval/_token_count.py ~38 tiktoken cl100k_base wrapper
tests/eval/_baseline_io.py ~180 Load / write / find / upsert + asymmetric regression rule with noise floors
tests/eval/run_preflight_cost_eval.py ~270 Pytest runner — C1 / C2 / C3 measurements
tests/eval/test_cost_baseline_helpers.py ~360 35 unit tests (helpers + seeder + regression rule + IO)
tests/eval/cost_baseline.jsonl 9 rows C1×3 + C2×3 platform-agnostic; C3×3 Darwin
.github/workflows/preflight-eval.yml +18 Phase 3 step
docs/preflight-failure-scenarios.md +6/-7 Catalog ticks
pyproject.toml +1 tiktoken in [test] extras

Recording flow

BICAMERAL_EVAL_RECORD_BASELINE=1 pytest tests/eval/run_preflight_cost_eval.py
git add tests/eval/cost_baseline.jsonl
git commit -m "test(eval): re-record C* baselines after <intentional change>"

The harness regenerates rows for the current platform; diff is reviewable in the PR. C1/C2 use platform-agnostic baselines (token counts and response bytes are deterministic across OSes). C3 latency is per-platform — Linux baselines need to be recorded separately on a Linux runner before CI Linux validates C3.

CI behavior

  • Phase 3 added to preflight-eval.yml, advisory (continue-on-error: true)
  • On Linux CI today: C1 + C2 validate against recorded_on=any rows; C3 skips with re-record instructions until Linux baseline is added
  • Phase 3 takes ~2.5 minutes (driven by N=1000 seeding + measurement). Non-blocking.

Test plan

  • pytest tests/eval/run_preflight_cost_eval.py — 9 passed (3× C1, 3× C2, 3× C3) against new real-ledger baselines
  • pytest tests/eval/test_cost_baseline_helpers.py — 35 passed (helpers + seeder + regression rule + IO)
  • All other eval suites unchanged: phase 1 (6 passed / 4 xfailed), phase 2 (1 passed / 3 skipped), regression suite (25 passed / 1 skipped)
  • Forced regression test: corrupt baseline by -30%, confirm pytest fails with the documented error message ✓
  • Reproducibility: re-record + run normally → green; bit-identical token counts across runs

Closes / unblocks

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@silongtan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d402957-f29c-4254-a330-4f135e217375

📥 Commits

Reviewing files that changed from the base of the PR and between 531067a and bbc5933.

📒 Files selected for processing (9)
  • .github/workflows/preflight-eval.yml
  • docs/preflight-failure-scenarios.md
  • pyproject.toml
  • tests/eval/_baseline_io.py
  • tests/eval/_synthetic_ledger.py
  • tests/eval/_token_count.py
  • tests/eval/cost_baseline.jsonl
  • tests/eval/run_preflight_cost_eval.py
  • tests/eval/test_cost_baseline_helpers.py
📝 Walkthrough

Walkthrough

Adds an advisory CI step to run preflight cost/latency evaluations against a committed baseline. Introduces helper modules for baselines, token counting, and synthetic ledger generation; a pytest runner covering C1–C3; a committed baseline JSONL; documentation updates; a new test dependency; and a comprehensive helper test suite.

Changes

Cohort / File(s) Summary
CI workflow (advisory preflight eval)
.github/workflows/preflight-eval.yml
Adds a non-blocking phase running tests/eval/run_preflight_cost_eval.py and emitting test-results/preflight-cost-eval.xml with continue-on-error: true.
Docs: baseline rules and checklist
docs/preflight-failure-scenarios.md
Updates §C to mark C1–C3 as baselined, defines asymmetric ±20% regression with noise floors, documents baseline recording via BICAMERAL_EVAL_RECORD_BASELINE=1, and links to CI workflow.
Project deps (tests)
pyproject.toml
Adds tiktoken>=0.7.0,<1.0.0 to the optional test dependency group.
Eval helpers (baseline IO, ledger, tokenization)
tests/eval/_baseline_io.py, tests/eval/_synthetic_ledger.py, tests/eval/_token_count.py
Introduces baseline storage/upsert and regression checking, a deterministic synthetic ledger generator, and token counting via tiktoken. Pay attention to platform/version keys and noise-floor logic.
Committed baselines
tests/eval/cost_baseline.jsonl
Adds initial JSONL rows for C1 token counts, C2 size metrics, and C3 latency (p50/p95), tagged with platform and versions.
Preflight eval runner (C1–C3)
tests/eval/run_preflight_cost_eval.py
Implements pytest-based evaluation: record-or-assert flow, platform/version matching, handler isolation, C1 token count over synthetic ledgers, C2 response token/byte sizes, C3 latency quantiles, JUnit output.
Helper test suite
tests/eval/test_cost_baseline_helpers.py
Tests synthetic ledger determinism/shape, token counting invariants, regression-check behavior, and baseline IO find/upsert semantics.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Dev as Developer
    participant CI as GitHub Actions
    participant Py as Pytest Runner
    participant BL as Baseline IO
    participant LG as Synthetic Ledger
    participant TK as Tokenizer
    participant HD as Preflight Handler
    participant JR as JUnit Report

    Dev->>CI: Push/PR
    CI->>Py: Run preflight cost eval (advisory)
    Py->>BL: Load committed baselines
    alt C1: history payload tokens
        Py->>LG: Generate synthetic ledger (N features)
        LG-->>Py: Deterministic payload
        Py->>TK: Count tokens (cl100k_base)
        TK-->>Py: Token count
        Py->>BL: Regression check (C1)
    else C2: preflight response size
        Py->>HD: handle_preflight(mocked ctx)
        HD-->>Py: Response JSON
        Py->>TK: Count tokens/bytes
        TK-->>Py: Sizes
        Py->>BL: Regression check (C2)
    else C3: handler latency
        Py->>HD: Warm + timed calls
        HD-->>Py: Latency samples
        Py->>BL: Regression checks (p50, p95)
    end
    Py-->>JR: Write test results (JUnit XML)
    CI-->>Dev: Report (non-blocking)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • jinhongkuan

Poem

I thump my paw—baseline set, hooray!
Tokens tallied, latencies at play.
Synthetic fields in orderly rows,
A ledger garden where the data grows.
If numbers drift, I gently squeak—
Flip the flag, record this week.
Hop-hop! Our preflight’s sleek. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 39.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Preflight eval: §C cost/latency baseline' directly and concisely summarizes the main change: introducing cost/latency baseline infrastructure for §C metrics.
Linked Issues check ✅ Passed The pull request fully implements all coding requirements from issue #88: deterministic synthetic ledger (C1), preflight response size measurement (C2), handler latency benchmarking (C3), baseline file with committed values, regression-check logic with ±20% rule, and CI workflow integration.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing C1–C3 baseline infrastructure per issue #88, with no unrelated modifications. C4 (LLM-in-the-loop), cross-model baselines, and promotion to gating CI are correctly deferred as out of scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch preflight-cost-baseline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 53 minutes and 20 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
tests/eval/run_preflight_cost_eval.py (2)

29-29: Unused import: asyncio.

The asyncio module is imported but not used directly — the async tests rely on pytest-asyncio to run the event loop.

🧹 Remove unused import
-import asyncio
 import sys
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/run_preflight_cost_eval.py` at line 29, Remove the unused asyncio
import from the top of the test file: delete the line importing asyncio in
run_preflight_cost_eval.py since pytest-asyncio provides the event loop and no
symbol from asyncio is referenced (there are no functions/classes in this diff
like run_preflight_cost_eval that require asyncio directly).

322-324: Minor: p95 index calculation is slightly off.

With _C3_SAMPLES=100, int(len(timings_ms) * 0.95) yields index 95, which accesses the 96th element in the sorted list. For a true p95, you'd typically use index 94 (the 95th element out of 100).

That said, the difference is negligible for this use case (benchmarking handler latency), and the value will still be close to p95.

💡 Standard percentile calculation
     timings_ms.sort()
     p50 = timings_ms[len(timings_ms) // 2]
-    p95 = timings_ms[int(len(timings_ms) * 0.95)]
+    p95 = timings_ms[int(len(timings_ms) * 0.95) - 1]  # 0-indexed: 95th of 100 is index 94

Or use statistics.quantiles for standard behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/run_preflight_cost_eval.py` around lines 322 - 324, The p95 index
calculation currently uses int(len(timings_ms) * 0.95) which yields index 95 for
_C3_SAMPLES=100 (the 96th element); change the p95 calculation to use a
zero-based 95th percentile index (e.g., idx = math.ceil(len(timings_ms) * 0.95)
- 1) and set p95 = timings_ms[idx]; ensure you import math if needed. Keep the
p50 logic as-is (timings_ms[len(timings_ms) // 2]) or alternatively replace both
with statistics.quantiles/two-line percentile helper if you prefer standard
behavior.
tests/eval/test_cost_baseline_helpers.py (1)

161-165: Potential test fragility: JSON string literal vs json.dumps output.

The direct string '{"foo": "bar", "n": 42}' assumes a specific key ordering and spacing that json.dumps produces. While CPython 3.7+ preserves dict insertion order, the assertion relies on json.dumps producing no trailing spaces and the exact same key order. This works today but could become fragile.

Consider using the JSON function for both sides to ensure consistency:

💡 More robust comparison
 def test_count_tokens_json_matches_direct_serialize():
     payload = {"foo": "bar", "n": 42}
-    direct = count_tokens('{"foo": "bar", "n": 42}')
+    import json
+    direct = count_tokens(json.dumps(payload, ensure_ascii=False))
     via_json = count_tokens_json(payload)
     assert direct == via_json
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_cost_baseline_helpers.py` around lines 161 - 165, The test
uses a hard-coded JSON literal which can be fragile due to spacing/ordering
differences; update test_count_tokens_json_matches_direct_serialize to produce
the direct string via json.dumps(payload) (or json.dumps(payload,
separators=(',', ':'), sort_keys=True) for a canonical form) and then call
count_tokens on that string so both sides use the same JSON serialization;
reference functions: test_count_tokens_json_matches_direct_serialize,
count_tokens, count_tokens_json, and json.dumps.
tests/eval/_baseline_io.py (2)

65-75: Use atomic replace when writing baseline files.

Line [75] writes directly to the target path; an interrupted write can corrupt the JSONL. A temp-file + replace pattern is safer in record mode.

♻️ Proposed improvement
 def write_baselines(rows: list[dict], path: Path = BASELINE_PATH) -> None:
     """Sorted, stable-key JSONL output to keep diffs minimal."""
     def _sort_key(row: dict) -> tuple:
         return (
             row.get("metric", ""),
             row.get("recorded_on", ""),
             row.get("n_features", -1),
         )
     rows_sorted = sorted(rows, key=_sort_key)
     body = "\n".join(json.dumps(r, sort_keys=True, ensure_ascii=False) for r in rows_sorted)
-    path.write_text(body + "\n", encoding="utf-8")
+    tmp_path = path.with_suffix(path.suffix + ".tmp")
+    tmp_path.write_text(body + "\n", encoding="utf-8")
+    tmp_path.replace(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/_baseline_io.py` around lines 65 - 75, The write_baselines
function currently writes directly to BASELINE_PATH which can corrupt the file
if interrupted; change write_baselines to write the JSONL content to a temporary
file in the same directory (e.g., using tempfile or Path with a unique suffix)
and then atomically replace the target with os.replace (or Path.replace) so the
final write is atomic; ensure the temp file is opened with utf-8 and that you
still write the newline-terminated body and clean up the temp on error.

54-62: Add line-context error reporting for malformed JSONL.

Lines [58]-[61] currently raise raw JSONDecodeError, which makes broken baseline rows harder to diagnose. Consider surfacing file + line number.

♻️ Proposed improvement
 def load_baselines(path: Path = BASELINE_PATH) -> list[dict]:
     if not path.exists():
         return []
     rows = []
-    for line in path.read_text(encoding="utf-8").splitlines():
+    for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
         line = line.strip()
         if line:
-            rows.append(json.loads(line))
+            try:
+                rows.append(json.loads(line))
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"Malformed baseline JSONL at {path} Line [{lineno}]") from exc
     return rows
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/_baseline_io.py` around lines 54 - 62, Wrap the json.loads call
inside load_baselines with a try/except that catches json.JSONDecodeError, track
the current line number using enumerate(path.read_text(...).splitlines(),
start=1) so you can include the file path and line number in the error, and
re-raise a clearer error (e.g., raise ValueError(f"Malformed JSON in {path} at
line {lineno}: {e}") from e) so the original JSONDecodeError is preserved as the
__cause__; update the rows.append(json.loads(line)) call in load_baselines
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/eval/_baseline_io.py`:
- Around line 65-75: The write_baselines function currently writes directly to
BASELINE_PATH which can corrupt the file if interrupted; change write_baselines
to write the JSONL content to a temporary file in the same directory (e.g.,
using tempfile or Path with a unique suffix) and then atomically replace the
target with os.replace (or Path.replace) so the final write is atomic; ensure
the temp file is opened with utf-8 and that you still write the
newline-terminated body and clean up the temp on error.
- Around line 54-62: Wrap the json.loads call inside load_baselines with a
try/except that catches json.JSONDecodeError, track the current line number
using enumerate(path.read_text(...).splitlines(), start=1) so you can include
the file path and line number in the error, and re-raise a clearer error (e.g.,
raise ValueError(f"Malformed JSON in {path} at line {lineno}: {e}") from e) so
the original JSONDecodeError is preserved as the __cause__; update the
rows.append(json.loads(line)) call in load_baselines accordingly.

In `@tests/eval/run_preflight_cost_eval.py`:
- Line 29: Remove the unused asyncio import from the top of the test file:
delete the line importing asyncio in run_preflight_cost_eval.py since
pytest-asyncio provides the event loop and no symbol from asyncio is referenced
(there are no functions/classes in this diff like run_preflight_cost_eval that
require asyncio directly).
- Around line 322-324: The p95 index calculation currently uses
int(len(timings_ms) * 0.95) which yields index 95 for _C3_SAMPLES=100 (the 96th
element); change the p95 calculation to use a zero-based 95th percentile index
(e.g., idx = math.ceil(len(timings_ms) * 0.95) - 1) and set p95 =
timings_ms[idx]; ensure you import math if needed. Keep the p50 logic as-is
(timings_ms[len(timings_ms) // 2]) or alternatively replace both with
statistics.quantiles/two-line percentile helper if you prefer standard behavior.

In `@tests/eval/test_cost_baseline_helpers.py`:
- Around line 161-165: The test uses a hard-coded JSON literal which can be
fragile due to spacing/ordering differences; update
test_count_tokens_json_matches_direct_serialize to produce the direct string via
json.dumps(payload) (or json.dumps(payload, separators=(',', ':'),
sort_keys=True) for a canonical form) and then call count_tokens on that string
so both sides use the same JSON serialization; reference functions:
test_count_tokens_json_matches_direct_serialize, count_tokens,
count_tokens_json, and json.dumps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19b3775b-88f2-43c6-bb46-3a151777aab0

📥 Commits

Reviewing files that changed from the base of the PR and between 92369b7 and 531067a.

📒 Files selected for processing (9)
  • .github/workflows/preflight-eval.yml
  • docs/preflight-failure-scenarios.md
  • pyproject.toml
  • tests/eval/_baseline_io.py
  • tests/eval/_synthetic_ledger.py
  • tests/eval/_token_count.py
  • tests/eval/cost_baseline.jsonl
  • tests/eval/run_preflight_cost_eval.py
  • tests/eval/test_cost_baseline_helpers.py

silongtan and others added 3 commits April 28, 2026 23:32
…+ runner

Stage 1-4 of issue #88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight

C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.

Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.

The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.

13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
  matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)

The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in #58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).

Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.

Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.

Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan merged commit dde17e7 into main Apr 29, 2026
5 checks passed
jinhongkuan added a commit that referenced this pull request Apr 29, 2026
* chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode

- Skill-level telemetry: replace per-tool timing with bicameral.skill_begin /
  bicameral.skill_end bookend tools; record_skill_event replaces record_event
- Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload
  interface; relay now validates only distinct_id + version + diagnostic numeric
  invariant, all other fields pass through — future event types require no relay
  redeploy; deployed to Cloudflare (v a6acec14)
- telemetry.py: add send_event() open primitive; record_skill_event is a thin
  wrapper; setup_wizard consent UI updated to show new skill-level payload shape
- reset wipe_mode: ledger (default, DB rows only, server stays live) vs full
  (deletes entire .bicameral/ dir including config + event files, reinits schema)
- ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row
  traversal — simpler, faster, correct for embedded surrealkv
- events/team_adapter.py: add explicit wipe_all_rows that resets event watermark
- contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields
- skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation
  phrasing; full mode requires showing bicameral_dir before confirm
- tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry

- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
  stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
  collision_unresolved, drift_mislabeled, low_confidence_verdict,
  ledger_empty, grounding_failed, user_abort, other) replacing the
  boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
  attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
  the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: delete stale bicameral-drift and bicameral-scan-branch skills

Both reference tools (bicameral.drift, bicameral.scan_branch) that no
longer exist in the server. Drift detection is handled by link_commit
+ auto-sync middleware + resolve_compliance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove embedded worktree from index, ignore .claude/worktrees

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: pass --no-cache-dir to pip install in update handler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use pipx install --force for upgrades, fall back to pip

sys.executable -m pip fails on Homebrew Python (externally-managed-
environment). pipx is the standard install path and handles its own
venv correctly. pipx also doesn't support --no-cache-dir so that flag
is dropped from the pip fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-mcp reset CLI — questionary wizard before wiping

Adds a `bicameral reset` subcommand that:
1. Prompts for wipe mode (ledger vs full) via questionary select
2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir
   for full mode with a ⚠️ warning)
3. Asks for explicit confirmation before calling handle_reset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-mcp config CLI — questionary wizard for config.yaml

Adds a `bicameral config` subcommand that:
1. Reads current config.yaml values as defaults
2. Prompts for mode, guided, telemetry via questionary selects
   with the current value pre-selected
3. Writes updated config.yaml
4. Reinstalls skills and hooks so changes take effect immediately

Replaces the LLM-in-chat text menu in the bicameral-config skill.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-config skill uses AskUserQuestion for all three settings

Replaces text-based [1/2] menus with a single AskUserQuestion call
covering mode, guided, and telemetry — all in one interactive prompt
within the Claude session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.12.2 — CLI wizards + telemetry quality loop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: add Dependabot for weekly pip dependency updates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter

Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
  G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate

AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
  batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
  irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
  AskUserQuestion, batched in groups of 4 for all correction counts

Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
  these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: bump RECOMMENDED_VERSION to 0.13.0

Was left at 0.12.2 — update handler checks this file to detect available upgrades.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: surface pending decisions when sync no-ops on same commit

After ingest, `bicameral sync` could return 'already_synced' with zero
compliance checks when HEAD hadn't moved — leaving newly-ingested decisions
stuck at `pending` indefinitely.

Two-part fix:
1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return,
   query `get_pending_decisions_with_regions()` and include any pending
   decisions as `pending_compliance_checks` in the response.
2. `handlers/link_commit.py` `invalidate_sync_cache` + new
   `sync_middleware.invalidate_process_cache()`: after any mutation (ingest,
   update, reset), clear the process-level `_LAST_SYNCED_SHA` so that
   `ensure_ledger_synced` runs a fresh sync on the next tool call even when
   HEAD hasn't moved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.1 — fix sync no-op on same commit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: ratify prompt fires last, after all decisions printed (ingest step 7)

Previously "after ingest" was ambiguous — LLM could fire the ratify
AskUserQuestion immediately after bicameral.ingest returned, before the
report (step 4), brief (step 5), and gap-judge (step 6) were shown.

Now step 7 is explicit:
- Must be the last user-facing output of the ingest flow
- Multi-segment ingests ratify once at the end of the roll-up, not per segment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.2 — ratify prompt ordering fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Preflight eval: §C cost/latency baseline (#90)

* test(eval): cost-baseline harness — synthetic ledger + token counter + runner

Stage 1-4 of issue #88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight

C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.

Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.

The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.

13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(eval): commit initial Darwin cost baselines

Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
  matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)

The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in #58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).

Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.

Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C

Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.

Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: enforce exact diagnostic field names in ingest + preflight telemetry

LLMs were substituting natural-language names (grounded, ungrounded,
channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed
names. The events landed in PostHog but fell through every dashboard panel
because the queries filter on the prefixed names.

Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'")
to both bicameral-ingest and bicameral-preflight skill_end sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: enforce skill diagnostic schema via Pydantic in skill_end handler

Previously diagnostic was an open object — LLMs sent improvised field names
(grounded, ungrounded, channels_read) that fell through every dashboard filter.

Now:
- IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py
  with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields
- skill_end handler validates against the per-skill model; unknown fields are
  stripped from the PostHog payload and echoed back in diagnostic_warning so
  the LLM immediately sees what it sent wrong on the same call
- inputSchema description enumerates all valid field names so the LLM has
  them visible at call time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: jinhongkuan <kuanjh123@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Silong Tan <silongtan@outlook.com>
Knapp-Kevin added a commit that referenced this pull request Apr 29, 2026
…dback) (#96)

* chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode

- Skill-level telemetry: replace per-tool timing with bicameral.skill_begin /
  bicameral.skill_end bookend tools; record_skill_event replaces record_event
- Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload
  interface; relay now validates only distinct_id + version + diagnostic numeric
  invariant, all other fields pass through — future event types require no relay
  redeploy; deployed to Cloudflare (v a6acec14)
- telemetry.py: add send_event() open primitive; record_skill_event is a thin
  wrapper; setup_wizard consent UI updated to show new skill-level payload shape
- reset wipe_mode: ledger (default, DB rows only, server stays live) vs full
  (deletes entire .bicameral/ dir including config + event files, reinits schema)
- ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row
  traversal — simpler, faster, correct for embedded surrealkv
- events/team_adapter.py: add explicit wipe_all_rows that resets event watermark
- contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields
- skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation
  phrasing; full mode requires showing bicameral_dir before confirm
- tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry

- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
  stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
  collision_unresolved, drift_mislabeled, low_confidence_verdict,
  ledger_empty, grounding_failed, user_abort, other) replacing the
  boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
  attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
  the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: delete stale bicameral-drift and bicameral-scan-branch skills

Both reference tools (bicameral.drift, bicameral.scan_branch) that no
longer exist in the server. Drift detection is handled by link_commit
+ auto-sync middleware + resolve_compliance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove embedded worktree from index, ignore .claude/worktrees

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: pass --no-cache-dir to pip install in update handler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use pipx install --force for upgrades, fall back to pip

sys.executable -m pip fails on Homebrew Python (externally-managed-
environment). pipx is the standard install path and handles its own
venv correctly. pipx also doesn't support --no-cache-dir so that flag
is dropped from the pip fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-mcp reset CLI — questionary wizard before wiping

Adds a `bicameral reset` subcommand that:
1. Prompts for wipe mode (ledger vs full) via questionary select
2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir
   for full mode with a ⚠️ warning)
3. Asks for explicit confirmation before calling handle_reset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-mcp config CLI — questionary wizard for config.yaml

Adds a `bicameral config` subcommand that:
1. Reads current config.yaml values as defaults
2. Prompts for mode, guided, telemetry via questionary selects
   with the current value pre-selected
3. Writes updated config.yaml
4. Reinstalls skills and hooks so changes take effect immediately

Replaces the LLM-in-chat text menu in the bicameral-config skill.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: bicameral-config skill uses AskUserQuestion for all three settings

Replaces text-based [1/2] menus with a single AskUserQuestion call
covering mode, guided, and telemetry — all in one interactive prompt
within the Claude session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.12.2 — CLI wizards + telemetry quality loop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: add Dependabot for weekly pip dependency updates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter

Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
  G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate

AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
  batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
  irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
  AskUserQuestion, batched in groups of 4 for all correction counts

Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
  these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: bump RECOMMENDED_VERSION to 0.13.0

Was left at 0.12.2 — update handler checks this file to detect available upgrades.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: surface pending decisions when sync no-ops on same commit

After ingest, `bicameral sync` could return 'already_synced' with zero
compliance checks when HEAD hadn't moved — leaving newly-ingested decisions
stuck at `pending` indefinitely.

Two-part fix:
1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return,
   query `get_pending_decisions_with_regions()` and include any pending
   decisions as `pending_compliance_checks` in the response.
2. `handlers/link_commit.py` `invalidate_sync_cache` + new
   `sync_middleware.invalidate_process_cache()`: after any mutation (ingest,
   update, reset), clear the process-level `_LAST_SYNCED_SHA` so that
   `ensure_ledger_synced` runs a fresh sync on the next tool call even when
   HEAD hasn't moved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.1 — fix sync no-op on same commit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: ratify prompt fires last, after all decisions printed (ingest step 7)

Previously "after ingest" was ambiguous — LLM could fire the ratify
AskUserQuestion immediately after bicameral.ingest returned, before the
report (step 4), brief (step 5), and gap-judge (step 6) were shown.

Now step 7 is explicit:
- Must be the last user-facing output of the ingest flow
- Multi-segment ingests ratify once at the end of the roll-up, not per segment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.2 — ratify prompt ordering fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Preflight eval: §C cost/latency baseline (#90)

* test(eval): cost-baseline harness — synthetic ledger + token counter + runner

Stage 1-4 of issue #88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight

C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.

Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.

The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.

13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(eval): commit initial Darwin cost baselines

Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
  matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)

The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in #58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).

Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.

Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C

Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.

Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: enforce exact diagnostic field names in ingest + preflight telemetry

LLMs were substituting natural-language names (grounded, ungrounded,
channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed
names. The events landed in PostHog but fell through every dashboard panel
because the queries filter on the prefixed names.

Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'")
to both bicameral-ingest and bicameral-preflight skill_end sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: enforce skill diagnostic schema via Pydantic in skill_end handler

Previously diagnostic was an open object — LLMs sent improvised field names
(grounded, ungrounded, channels_read) that fell through every dashboard filter.

Now:
- IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py
  with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields
- skill_end handler validates against the per-skill model; unknown fields are
  stripped from the PostHog payload and echoed back in diagnostic_warning so
  the LLM immediately sees what it sent wrong on the same call
- inputSchema description enumerates all valid field names so the LLM has
  them visible at call time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: VHS demo — 5 core use case flows (ingest, preflight, sync, history)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove demo directory

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump to v0.13.4 — branch-scoped ephemeral bind + stale hash repair

B9: handlers/bind.py used authoritative_sha for all file checks and hash
computation regardless of branch. On feature branches this caused (1) spurious
rejection of branch-local files and (2) phantom "drifted" status after
resolve_compliance because bind stored H_main while link_commit computed
H_branch. Fix: detect _is_ephemeral_commit and use head_sha as effective_ref.

B10: ingest_commit's already_synced early-return left stale "reflected" status
when returning to main after feature-branch bind work. The repair path in the
already_synced branch now uses get_regions_with_ephemeral_verdicts (indexed
lookup via idx_cc_ephemeral) to find only suspect regions, updates their hashes
to the authoritative content, and re-projects decision status. Two-pass approach
deduplicates project_decision_status calls per decision.

Tests: E18-E22 added (22/22 ephemeral/authoritative scenarios pass).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: set RECOMMENDED_VERSION to 0.13.4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(eval): real-ledger seeder for cost/latency baselines

Stage 6 of issue #88 path-3 rework. Adds `tests/eval/_seed_ledger.py` —
translates a synthetic HistoryResponse-shaped dict (from the existing
generator) into real SurrealDB writes via `adapter.ingest_payload`, the
production ingestion path.

Uses the synthetic-repo fallback (repo path not on disk → empty
content_hash) so seeding works without git fixtures. Status overrides
post-ingest via `update_decision_status` to match the synthetic
generator's intended distribution (70% reflected / 20% drifted /
10% other) — bypasses derive_status since there's no real file content.

Three new unit tests:
- N=10 seeds 30 decisions, ledger contains exactly that count
- N=100 status distribution roughly matches synthetic generator's
- Empty input returns 0

Stage 7 will use this seeder to run C2 + C3 against real seeded
ledgers instead of mocked queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(eval): C2/C3 against real seeded ledger, parametrized by N=10/100/1000

Stage 7 of issue #88 path-3 rework. Addresses Jin's "test not very useful
if it doesnt capture updates" feedback by switching C2 and C3 from mocked
ledger queries to a real `memory://` SurrealDB seeded with N synthetic
features. The handler now executes the real SurrealDB query path on every
measurement — same code the developer hits in production.

Real-I/O baselines (Darwin local, Python 3.12 + SurrealDB 2.x):

| N | C2 tokens / bytes | C3 p50 / p95 |
|---|---|---|
| 10 | 566 / 2,303 | 2.5ms / 3.0ms |
| 100 | 571 / 2,303 | 14.8ms / 15.9ms |
| 1000 | 575 / 2,303 | 138.8ms / 141.7ms |

C3 latency at N=1000 is ~1700× the previous mocked baseline (138ms vs
0.08ms). That's the user-experience-relevant signal — and exactly the
regression target an optimization PR (#58 directions: semantic prefilter,
lazy/two-pass history) should reduce.

Platform tagging:
- C1: `recorded_on=any` (token counts are deterministic across OSes)
- C2: `recorded_on=any` (response shape is deterministic given same seed;
  noise floor absorbs sync_metrics timing variance)
- C3: per-platform `darwin` (real I/O latency varies meaningfully by host;
  Linux baselines must be recorded separately on a Linux runner)

Schema additions:
- `_baseline_io.ANY_PLATFORM` sentinel — a row with this value matches
  every host. `find_baseline` now treats `recorded_on=any` rows as
  matches regardless of caller's platform.
- `_record_or_assert(platform_agnostic=True)` records and matches with
  the sentinel.

Implementation notes:
- C2/C3 each spin up a fresh adapter per parametrized run — no cross-test
  state, no singleton reset needed.
- file_paths chosen from synthetic decisions via `_pick_grounded_paths`
  to guarantee region-anchored matches (response fires non-trivially).
- Seeding cost: ~62s at N=1000 (3000 ingest_payload mappings through
  the real ingest path + status updates). Total cost-eval runtime:
  ~2m30s. Acceptable for advisory CI; non-blocking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(catalog): refresh §C wording for real-ledger C2/C3

Stage 8 of issue #88 path-3 rework. Updates the catalog's §C entries to
reflect that C2 + C3 now measure against a real seeded ledger, not
mocked queries. Adds the real-ledger seeder to the implementation queue
ticked items and clarifies the per-platform vs platform-agnostic split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: jinhongkuan <kuanjh123@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: WulfForge <krknapp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preflight eval: §C cost/latency baseline — bicameral.history() payload + handler latency

2 participants