Preflight eval: §C cost/latency baseline by silongtan · Pull Request #90 · BicameralAI/bicameral-mcp

silongtan · 2026-04-29T03:26:33Z

Summary

Implements #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Prerequisite for any optimization PR against #58 to have a regression target.

Scope evolved during review: original C3 latency measurement used mocked ledger queries, which Jin correctly flagged as not capturing real updates. Reworked to real memory:// SurrealDB seeded with synthetic data through the production ingest_payload path — every C2/C3 measurement now exercises the real SurrealDB query plan, handler iteration, and serialization.

Metrics + baselines

Metric	What	N=10	N=100	N=1000
C1	`bicameral.history()` payload tokens	7,574	79,025	795,982
C2	`bicameral.preflight()` response tokens	566	571	575
C2	`bicameral.preflight()` response bytes	2,303	2,303	2,303
C3	Handler latency p50	2.5ms	14.8ms	138.8ms
C3	Handler latency p95	3.0ms	15.9ms	141.7ms

Two punch lines from the data:

C1 N=1000 = 796K tokens — a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. The §C concern is concrete and material.
C3 N=1000 = ~140ms — real-ledger preflight at scale crosses into user-perceptible latency. That's the user-experience surface an optimization PR should reduce.

What this test actually catches

Future PR adds a verbose field to HistoryResponse → C1 token count grows → flag
Future PR changes the SurrealDB query plan to scan instead of index lookup → C3 latency grows → flag
Future PR bloats PreflightResponse shape → C2 byte count grows → flag
Future optimization (semantic prefilter, lazy/two-pass history from M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58) reduces C1/C3 → measurable win
Asymmetric rule means improvements never alert; only regressions trip

What this test explicitly does NOT catch

Real-world decision content distributions (synthetic generator, fixed templates) — that's Preflight: real-world eval coverage (real-transcript fixtures + CI dual-eval) #66's territory
Skill-layer LLM cost — that's phase 2 of the eval harness + Preflight eval: phase 2 dataset expansion — M2, M3, FF3 rows #89
Production user-perceived latency from real traffic — that's Preflight: telemetry capture loop for real-world failure feedback #65 (telemetry)

The three together form a measurement strategy: phase 3 is the synthetic baseline that holds optimization claims accountable; #65 + #66 cover what synthetic doesn't.

Architecture

Component	Lines	Purpose
`tests/eval/_synthetic_ledger.py`	~195	Deterministic generator producing `HistoryResponse`-shaped dicts
`tests/eval/_seed_ledger.py`	~120	Translates synthetic dict → real SurrealDB writes via `adapter.ingest_payload`
`tests/eval/_token_count.py`	~38	tiktoken `cl100k_base` wrapper
`tests/eval/_baseline_io.py`	~180	Load / write / find / upsert + asymmetric regression rule with noise floors
`tests/eval/run_preflight_cost_eval.py`	~270	Pytest runner — C1 / C2 / C3 measurements
`tests/eval/test_cost_baseline_helpers.py`	~360	35 unit tests (helpers + seeder + regression rule + IO)
`tests/eval/cost_baseline.jsonl`	9 rows	C1×3 + C2×3 platform-agnostic; C3×3 Darwin
`.github/workflows/preflight-eval.yml`	+18	Phase 3 step
`docs/preflight-failure-scenarios.md`	+6/-7	Catalog ticks
`pyproject.toml`	+1	tiktoken in `[test]` extras

Recording flow

BICAMERAL_EVAL_RECORD_BASELINE=1 pytest tests/eval/run_preflight_cost_eval.py
git add tests/eval/cost_baseline.jsonl
git commit -m "test(eval): re-record C* baselines after <intentional change>"

The harness regenerates rows for the current platform; diff is reviewable in the PR. C1/C2 use platform-agnostic baselines (token counts and response bytes are deterministic across OSes). C3 latency is per-platform — Linux baselines need to be recorded separately on a Linux runner before CI Linux validates C3.

CI behavior

Phase 3 added to preflight-eval.yml, advisory (continue-on-error: true)
On Linux CI today: C1 + C2 validate against recorded_on=any rows; C3 skips with re-record instructions until Linux baseline is added
Phase 3 takes ~2.5 minutes (driven by N=1000 seeding + measurement). Non-blocking.

Test plan

pytest tests/eval/run_preflight_cost_eval.py — 9 passed (3× C1, 3× C2, 3× C3) against new real-ledger baselines
pytest tests/eval/test_cost_baseline_helpers.py — 35 passed (helpers + seeder + regression rule + IO)
All other eval suites unchanged: phase 1 (6 passed / 4 xfailed), phase 2 (1 passed / 3 skipped), regression suite (25 passed / 1 skipped)
Forced regression test: corrupt baseline by -30%, confirm pytest fails with the documented error message ✓
Reproducibility: re-record + run normally → green; bit-identical token counts across runs

Closes / unblocks

Closes Preflight eval: §C cost/latency baseline — bicameral.history() payload + handler latency #88
Unblocks M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58 — every optimization PR can now be evaluated against the committed real-I/O baselines

🤖 Generated with Claude Code

coderabbitai · 2026-04-29T03:26:45Z

Warning

Rate limit exceeded

@silongtan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d402957-f29c-4254-a330-4f135e217375

📥 Commits

Reviewing files that changed from the base of the PR and between 531067a and bbc5933.

📒 Files selected for processing (9)

.github/workflows/preflight-eval.yml
docs/preflight-failure-scenarios.md
pyproject.toml
tests/eval/_baseline_io.py
tests/eval/_synthetic_ledger.py
tests/eval/_token_count.py
tests/eval/cost_baseline.jsonl
tests/eval/run_preflight_cost_eval.py
tests/eval/test_cost_baseline_helpers.py

📝 Walkthrough

Walkthrough

Adds an advisory CI step to run preflight cost/latency evaluations against a committed baseline. Introduces helper modules for baselines, token counting, and synthetic ledger generation; a pytest runner covering C1–C3; a committed baseline JSONL; documentation updates; a new test dependency; and a comprehensive helper test suite.

Changes

Cohort / File(s)	Summary
CI workflow (advisory preflight eval) `.github/workflows/preflight-eval.yml`	Adds a non-blocking phase running `tests/eval/run_preflight_cost_eval.py` and emitting `test-results/preflight-cost-eval.xml` with `continue-on-error: true`.
Docs: baseline rules and checklist `docs/preflight-failure-scenarios.md`	Updates §C to mark C1–C3 as baselined, defines asymmetric ±20% regression with noise floors, documents baseline recording via `BICAMERAL_EVAL_RECORD_BASELINE=1`, and links to CI workflow.
Project deps (tests) `pyproject.toml`	Adds `tiktoken>=0.7.0,<1.0.0` to the optional `test` dependency group.
Eval helpers (baseline IO, ledger, tokenization) `tests/eval/_baseline_io.py`, `tests/eval/_synthetic_ledger.py`, `tests/eval/_token_count.py`	Introduces baseline storage/upsert and regression checking, a deterministic synthetic ledger generator, and token counting via `tiktoken`. Pay attention to platform/version keys and noise-floor logic.
Committed baselines `tests/eval/cost_baseline.jsonl`	Adds initial JSONL rows for C1 token counts, C2 size metrics, and C3 latency (p50/p95), tagged with platform and versions.
Preflight eval runner (C1–C3) `tests/eval/run_preflight_cost_eval.py`	Implements pytest-based evaluation: record-or-assert flow, platform/version matching, handler isolation, C1 token count over synthetic ledgers, C2 response token/byte sizes, C3 latency quantiles, JUnit output.
Helper test suite `tests/eval/test_cost_baseline_helpers.py`	Tests synthetic ledger determinism/shape, token counting invariants, regression-check behavior, and baseline IO find/upsert semantics.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Dev as Developer
    participant CI as GitHub Actions
    participant Py as Pytest Runner
    participant BL as Baseline IO
    participant LG as Synthetic Ledger
    participant TK as Tokenizer
    participant HD as Preflight Handler
    participant JR as JUnit Report

    Dev->>CI: Push/PR
    CI->>Py: Run preflight cost eval (advisory)
    Py->>BL: Load committed baselines
    alt C1: history payload tokens
        Py->>LG: Generate synthetic ledger (N features)
        LG-->>Py: Deterministic payload
        Py->>TK: Count tokens (cl100k_base)
        TK-->>Py: Token count
        Py->>BL: Regression check (C1)
    else C2: preflight response size
        Py->>HD: handle_preflight(mocked ctx)
        HD-->>Py: Response JSON
        Py->>TK: Count tokens/bytes
        TK-->>Py: Sizes
        Py->>BL: Regression check (C2)
    else C3: handler latency
        Py->>HD: Warm + timed calls
        HD-->>Py: Latency samples
        Py->>BL: Regression checks (p50, p95)
    end
    Py-->>JR: Write test results (JUnit XML)
    CI-->>Dev: Report (non-blocking)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Preflight failure-scenario catalog + eval harness (v0.10.x) #62 — Introduces the preflight evaluation catalog and groundwork that this PR extends with baselines and CI integration.

Suggested reviewers

jinhongkuan

Poem

I thump my paw—baseline set, hooray!
Tokens tallied, latencies at play.
Synthetic fields in orderly rows,
A ledger garden where the data grows.
If numbers drift, I gently squeak—
Flip the flag, record this week.
Hop-hop! Our preflight’s sleek. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 39.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Preflight eval: §C cost/latency baseline' directly and concisely summarizes the main change: introducing cost/latency baseline infrastructure for §C metrics.
Linked Issues check	✅ Passed	The pull request fully implements all coding requirements from issue `#88`: deterministic synthetic ledger (C1), preflight response size measurement (C2), handler latency benchmarking (C3), baseline file with committed values, regression-check logic with ±20% rule, and CI workflow integration.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing C1–C3 baseline infrastructure per issue `#88`, with no unrelated modifications. C4 (LLM-in-the-loop), cross-model baselines, and promotion to gating CI are correctly deferred as out of scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch preflight-cost-baseline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 53 minutes and 20 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (5)

tests/eval/run_preflight_cost_eval.py (2)

29-29: Unused import: asyncio.

The asyncio module is imported but not used directly — the async tests rely on pytest-asyncio to run the event loop.
🧹 Remove unused import
-import asyncio
 import sys
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/run_preflight_cost_eval.py` at line 29, Remove the unused asyncio
import from the top of the test file: delete the line importing asyncio in
run_preflight_cost_eval.py since pytest-asyncio provides the event loop and no
symbol from asyncio is referenced (there are no functions/classes in this diff
like run_preflight_cost_eval that require asyncio directly).
322-324: Minor: p95 index calculation is slightly off.

With _C3_SAMPLES=100, int(len(timings_ms) * 0.95) yields index 95, which accesses the 96th element in the sorted list. For a true p95, you'd typically use index 94 (the 95th element out of 100).

That said, the difference is negligible for this use case (benchmarking handler latency), and the value will still be close to p95.
💡 Standard percentile calculation
     timings_ms.sort()
     p50 = timings_ms[len(timings_ms) // 2]
-    p95 = timings_ms[int(len(timings_ms) * 0.95)]
+    p95 = timings_ms[int(len(timings_ms) * 0.95) - 1]  # 0-indexed: 95th of 100 is index 94
Or use statistics.quantiles for standard behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/run_preflight_cost_eval.py` around lines 322 - 324, The p95 index
calculation currently uses int(len(timings_ms) * 0.95) which yields index 95 for
_C3_SAMPLES=100 (the 96th element); change the p95 calculation to use a
zero-based 95th percentile index (e.g., idx = math.ceil(len(timings_ms) * 0.95)
- 1) and set p95 = timings_ms[idx]; ensure you import math if needed. Keep the
p50 logic as-is (timings_ms[len(timings_ms) // 2]) or alternatively replace both
with statistics.quantiles/two-line percentile helper if you prefer standard
behavior.

tests/eval/test_cost_baseline_helpers.py (1)

161-165: Potential test fragility: JSON string literal vs json.dumps output.

The direct string '{"foo": "bar", "n": 42}' assumes a specific key ordering and spacing that json.dumps produces. While CPython 3.7+ preserves dict insertion order, the assertion relies on json.dumps producing no trailing spaces and the exact same key order. This works today but could become fragile.

Consider using the JSON function for both sides to ensure consistency:
💡 More robust comparison
 def test_count_tokens_json_matches_direct_serialize():
     payload = {"foo": "bar", "n": 42}
-    direct = count_tokens('{"foo": "bar", "n": 42}')
+    import json
+    direct = count_tokens(json.dumps(payload, ensure_ascii=False))
     via_json = count_tokens_json(payload)
     assert direct == via_json
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_cost_baseline_helpers.py` around lines 161 - 165, The test
uses a hard-coded JSON literal which can be fragile due to spacing/ordering
differences; update test_count_tokens_json_matches_direct_serialize to produce
the direct string via json.dumps(payload) (or json.dumps(payload,
separators=(',', ':'), sort_keys=True) for a canonical form) and then call
count_tokens on that string so both sides use the same JSON serialization;
reference functions: test_count_tokens_json_matches_direct_serialize,
count_tokens, count_tokens_json, and json.dumps.

tests/eval/_baseline_io.py (2)

65-75: Use atomic replace when writing baseline files.

Line [75] writes directly to the target path; an interrupted write can corrupt the JSONL. A temp-file + replace pattern is safer in record mode.

♻️ Proposed improvement

 def write_baselines(rows: list[dict], path: Path = BASELINE_PATH) -> None:
     """Sorted, stable-key JSONL output to keep diffs minimal."""
     def _sort_key(row: dict) -> tuple:
         return (
             row.get("metric", ""),
             row.get("recorded_on", ""),
             row.get("n_features", -1),
         )
     rows_sorted = sorted(rows, key=_sort_key)
     body = "\n".join(json.dumps(r, sort_keys=True, ensure_ascii=False) for r in rows_sorted)
-    path.write_text(body + "\n", encoding="utf-8")
+    tmp_path = path.with_suffix(path.suffix + ".tmp")
+    tmp_path.write_text(body + "\n", encoding="utf-8")
+    tmp_path.replace(path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/_baseline_io.py` around lines 65 - 75, The write_baselines
function currently writes directly to BASELINE_PATH which can corrupt the file
if interrupted; change write_baselines to write the JSONL content to a temporary
file in the same directory (e.g., using tempfile or Path with a unique suffix)
and then atomically replace the target with os.replace (or Path.replace) so the
final write is atomic; ensure the temp file is opened with utf-8 and that you
still write the newline-terminated body and clean up the temp on error.

54-62: Add line-context error reporting for malformed JSONL.

Lines [58]-[61] currently raise raw JSONDecodeError, which makes broken baseline rows harder to diagnose. Consider surfacing file + line number.

♻️ Proposed improvement

 def load_baselines(path: Path = BASELINE_PATH) -> list[dict]:
     if not path.exists():
         return []
     rows = []
-    for line in path.read_text(encoding="utf-8").splitlines():
+    for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
         line = line.strip()
         if line:
-            rows.append(json.loads(line))
+            try:
+                rows.append(json.loads(line))
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"Malformed baseline JSONL at {path} Line [{lineno}]") from exc
     return rows

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/_baseline_io.py` around lines 54 - 62, Wrap the json.loads call
inside load_baselines with a try/except that catches json.JSONDecodeError, track
the current line number using enumerate(path.read_text(...).splitlines(),
start=1) so you can include the file path and line number in the error, and
re-raise a clearer error (e.g., raise ValueError(f"Malformed JSON in {path} at
line {lineno}: {e}") from e) so the original JSONDecodeError is preserved as the
__cause__; update the rows.append(json.loads(line)) call in load_baselines
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/eval/_baseline_io.py`:
- Around line 65-75: The write_baselines function currently writes directly to
BASELINE_PATH which can corrupt the file if interrupted; change write_baselines
to write the JSONL content to a temporary file in the same directory (e.g.,
using tempfile or Path with a unique suffix) and then atomically replace the
target with os.replace (or Path.replace) so the final write is atomic; ensure
the temp file is opened with utf-8 and that you still write the
newline-terminated body and clean up the temp on error.
- Around line 54-62: Wrap the json.loads call inside load_baselines with a
try/except that catches json.JSONDecodeError, track the current line number
using enumerate(path.read_text(...).splitlines(), start=1) so you can include
the file path and line number in the error, and re-raise a clearer error (e.g.,
raise ValueError(f"Malformed JSON in {path} at line {lineno}: {e}") from e) so
the original JSONDecodeError is preserved as the __cause__; update the
rows.append(json.loads(line)) call in load_baselines accordingly.

In `@tests/eval/run_preflight_cost_eval.py`:
- Line 29: Remove the unused asyncio import from the top of the test file:
delete the line importing asyncio in run_preflight_cost_eval.py since
pytest-asyncio provides the event loop and no symbol from asyncio is referenced
(there are no functions/classes in this diff like run_preflight_cost_eval that
require asyncio directly).
- Around line 322-324: The p95 index calculation currently uses
int(len(timings_ms) * 0.95) which yields index 95 for _C3_SAMPLES=100 (the 96th
element); change the p95 calculation to use a zero-based 95th percentile index
(e.g., idx = math.ceil(len(timings_ms) * 0.95) - 1) and set p95 =
timings_ms[idx]; ensure you import math if needed. Keep the p50 logic as-is
(timings_ms[len(timings_ms) // 2]) or alternatively replace both with
statistics.quantiles/two-line percentile helper if you prefer standard behavior.

In `@tests/eval/test_cost_baseline_helpers.py`:
- Around line 161-165: The test uses a hard-coded JSON literal which can be
fragile due to spacing/ordering differences; update
test_count_tokens_json_matches_direct_serialize to produce the direct string via
json.dumps(payload) (or json.dumps(payload, separators=(',', ':'),
sort_keys=True) for a canonical form) and then call count_tokens on that string
so both sides use the same JSON serialization; reference functions:
test_count_tokens_json_matches_direct_serialize, count_tokens,
count_tokens_json, and json.dumps.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19b3775b-88f2-43c6-bb46-3a151777aab0

📥 Commits

Reviewing files that changed from the base of the PR and between 92369b7 and 531067a.

📒 Files selected for processing (9)

.github/workflows/preflight-eval.yml
docs/preflight-failure-scenarios.md
pyproject.toml
tests/eval/_baseline_io.py
tests/eval/_synthetic_ledger.py
tests/eval/_token_count.py
tests/eval/cost_baseline.jsonl
tests/eval/run_preflight_cost_eval.py
tests/eval/test_cost_baseline_helpers.py

…+ runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode - Skill-level telemetry: replace per-tool timing with bicameral.skill_begin / bicameral.skill_end bookend tools; record_skill_event replaces record_event - Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload interface; relay now validates only distinct_id + version + diagnostic numeric invariant, all other fields pass through — future event types require no relay redeploy; deployed to Cloudflare (v a6acec14) - telemetry.py: add send_event() open primitive; record_skill_event is a thin wrapper; setup_wizard consent UI updated to show new skill-level payload shape - reset wipe_mode: ledger (default, DB rows only, server stays live) vs full (deletes entire .bicameral/ dir including config + event files, reinits schema) - ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row traversal — simpler, faster, correct for embedded surrealkv - events/team_adapter.py: add explicit wipe_all_rows that resets event watermark - contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields - skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation phrasing; full mode requires showing bicameral_dir before confirm - tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry - bicameral.skill_begin now accepts `rationale` (why the skill triggered) stored in _skill_sessions dict alongside t0 and forwarded at skill_end - bicameral.skill_end now accepts `error_class` enum (symbol_not_found, collision_unresolved, drift_mislabeled, low_confidence_verdict, ledger_empty, grounding_failed, user_abort, other) replacing the boolean-only errored signal - New bicameral.feedback tool: call when stuck — records {trying_to, attempted, stuck_on} as agent_feedback events mapping to desync catalog - All 8 major skills updated with Telemetry bookend sections showing the skill_begin/skill_end pattern with rationale + error_class examples - telemetry.record_skill_event extended with error_class and rationale kwargs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: delete stale bicameral-drift and bicameral-scan-branch skills Both reference tools (bicameral.drift, bicameral.scan_branch) that no longer exist in the server. Drift detection is handled by link_commit + auto-sync middleware + resolve_compliance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove embedded worktree from index, ignore .claude/worktrees Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: pass --no-cache-dir to pip install in update handler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use pipx install --force for upgrades, fall back to pip sys.executable -m pip fails on Homebrew Python (externally-managed- environment). pipx is the standard install path and handles its own venv correctly. pipx also doesn't support --no-cache-dir so that flag is dropped from the pip fallback path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp reset CLI — questionary wizard before wiping Adds a `bicameral reset` subcommand that: 1. Prompts for wipe mode (ledger vs full) via questionary select 2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir for full mode with a ⚠️ warning) 3. Asks for explicit confirmation before calling handle_reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp config CLI — questionary wizard for config.yaml Adds a `bicameral config` subcommand that: 1. Reads current config.yaml values as defaults 2. Prompts for mode, guided, telemetry via questionary selects with the current value pre-selected 3. Writes updated config.yaml 4. Reinstalls skills and hooks so changes take effect immediately Replaces the LLM-in-chat text menu in the bicameral-config skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-config skill uses AskUserQuestion for all three settings Replaces text-based [1/2] menus with a single AskUserQuestion call covering mode, guided, and telemetry — all in one interactive prompt within the Claude session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.2 — CLI wizards + telemetry quality loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: add Dependabot for weekly pip dependency updates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter Telemetry schema (all skills): - g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest, G9/G10/G11 in preflight, G11 in capture-corrections) - skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled - g{N}_user_overrode as universal ground-truth signal at every interactive gate AskUserQuestion ground truth wiring: - G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops, batched in groups of 4; guarded by guided_mode - G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss irrelevant findings; guarded by guided_mode; populates g10_user_overrode - G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with AskUserQuestion, batched in groups of 4 for all correction counts Liberal ingest filter: - Removed aspirational, hedged conditional, and parked/deferred from hard-exclude; these now flow through level classification and gate filters as speculative proposals - Ratification is the team's judgment layer, not the extraction filter - Updated Example 1: now extracts 3 speculative proposals instead of 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: bump RECOMMENDED_VERSION to 0.13.0 Was left at 0.12.2 — update handler checks this file to detect available upgrades. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: surface pending decisions when sync no-ops on same commit After ingest, `bicameral sync` could return 'already_synced' with zero compliance checks when HEAD hadn't moved — leaving newly-ingested decisions stuck at `pending` indefinitely. Two-part fix: 1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return, query `get_pending_decisions_with_regions()` and include any pending decisions as `pending_compliance_checks` in the response. 2. `handlers/link_commit.py` `invalidate_sync_cache` + new `sync_middleware.invalidate_process_cache()`: after any mutation (ingest, update, reset), clear the process-level `_LAST_SYNCED_SHA` so that `ensure_ledger_synced` runs a fresh sync on the next tool call even when HEAD hasn't moved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.1 — fix sync no-op on same commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: ratify prompt fires last, after all decisions printed (ingest step 7) Previously "after ingest" was ambiguous — LLM could fire the ratify AskUserQuestion immediately after bicameral.ingest returned, before the report (step 4), brief (step 5), and gap-judge (step 6) were shown. Now step 7 is explicit: - Must be the last user-facing output of the ingest flow - Multi-segment ingests ratify once at the end of the roll-up, not per segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.2 — ratify prompt ordering fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Preflight eval: §C cost/latency baseline (#90) * test(eval): cost-baseline harness — synthetic ledger + token counter + runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): commit initial Darwin cost baselines Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: enforce exact diagnostic field names in ingest + preflight telemetry LLMs were substituting natural-language names (grounded, ungrounded, channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed names. The events landed in PostHog but fell through every dashboard panel because the queries filter on the prefixed names. Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'") to both bicameral-ingest and bicameral-preflight skill_end sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enforce skill diagnostic schema via Pydantic in skill_end handler Previously diagnostic was an open object — LLMs sent improvised field names (grounded, ungrounded, channels_read) that fell through every dashboard filter. Now: - IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields - skill_end handler validates against the per-skill model; unknown fields are stripped from the PostHog payload and echoed back in diagnostic_warning so the LLM immediately sees what it sent wrong on the same call - inputSchema description enumerates all valid field names so the LLM has them visible at call time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: jinhongkuan <kuanjh123@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Silong Tan <silongtan@outlook.com>

…dback) (#96) * chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode - Skill-level telemetry: replace per-tool timing with bicameral.skill_begin / bicameral.skill_end bookend tools; record_skill_event replaces record_event - Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload interface; relay now validates only distinct_id + version + diagnostic numeric invariant, all other fields pass through — future event types require no relay redeploy; deployed to Cloudflare (v a6acec14) - telemetry.py: add send_event() open primitive; record_skill_event is a thin wrapper; setup_wizard consent UI updated to show new skill-level payload shape - reset wipe_mode: ledger (default, DB rows only, server stays live) vs full (deletes entire .bicameral/ dir including config + event files, reinits schema) - ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row traversal — simpler, faster, correct for embedded surrealkv - events/team_adapter.py: add explicit wipe_all_rows that resets event watermark - contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields - skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation phrasing; full mode requires showing bicameral_dir before confirm - tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry - bicameral.skill_begin now accepts `rationale` (why the skill triggered) stored in _skill_sessions dict alongside t0 and forwarded at skill_end - bicameral.skill_end now accepts `error_class` enum (symbol_not_found, collision_unresolved, drift_mislabeled, low_confidence_verdict, ledger_empty, grounding_failed, user_abort, other) replacing the boolean-only errored signal - New bicameral.feedback tool: call when stuck — records {trying_to, attempted, stuck_on} as agent_feedback events mapping to desync catalog - All 8 major skills updated with Telemetry bookend sections showing the skill_begin/skill_end pattern with rationale + error_class examples - telemetry.record_skill_event extended with error_class and rationale kwargs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: delete stale bicameral-drift and bicameral-scan-branch skills Both reference tools (bicameral.drift, bicameral.scan_branch) that no longer exist in the server. Drift detection is handled by link_commit + auto-sync middleware + resolve_compliance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove embedded worktree from index, ignore .claude/worktrees Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: pass --no-cache-dir to pip install in update handler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use pipx install --force for upgrades, fall back to pip sys.executable -m pip fails on Homebrew Python (externally-managed- environment). pipx is the standard install path and handles its own venv correctly. pipx also doesn't support --no-cache-dir so that flag is dropped from the pip fallback path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp reset CLI — questionary wizard before wiping Adds a `bicameral reset` subcommand that: 1. Prompts for wipe mode (ledger vs full) via questionary select 2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir for full mode with a ⚠️ warning) 3. Asks for explicit confirmation before calling handle_reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp config CLI — questionary wizard for config.yaml Adds a `bicameral config` subcommand that: 1. Reads current config.yaml values as defaults 2. Prompts for mode, guided, telemetry via questionary selects with the current value pre-selected 3. Writes updated config.yaml 4. Reinstalls skills and hooks so changes take effect immediately Replaces the LLM-in-chat text menu in the bicameral-config skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-config skill uses AskUserQuestion for all three settings Replaces text-based [1/2] menus with a single AskUserQuestion call covering mode, guided, and telemetry — all in one interactive prompt within the Claude session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.2 — CLI wizards + telemetry quality loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: add Dependabot for weekly pip dependency updates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter Telemetry schema (all skills): - g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest, G9/G10/G11 in preflight, G11 in capture-corrections) - skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled - g{N}_user_overrode as universal ground-truth signal at every interactive gate AskUserQuestion ground truth wiring: - G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops, batched in groups of 4; guarded by guided_mode - G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss irrelevant findings; guarded by guided_mode; populates g10_user_overrode - G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with AskUserQuestion, batched in groups of 4 for all correction counts Liberal ingest filter: - Removed aspirational, hedged conditional, and parked/deferred from hard-exclude; these now flow through level classification and gate filters as speculative proposals - Ratification is the team's judgment layer, not the extraction filter - Updated Example 1: now extracts 3 speculative proposals instead of 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: bump RECOMMENDED_VERSION to 0.13.0 Was left at 0.12.2 — update handler checks this file to detect available upgrades. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: surface pending decisions when sync no-ops on same commit After ingest, `bicameral sync` could return 'already_synced' with zero compliance checks when HEAD hadn't moved — leaving newly-ingested decisions stuck at `pending` indefinitely. Two-part fix: 1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return, query `get_pending_decisions_with_regions()` and include any pending decisions as `pending_compliance_checks` in the response. 2. `handlers/link_commit.py` `invalidate_sync_cache` + new `sync_middleware.invalidate_process_cache()`: after any mutation (ingest, update, reset), clear the process-level `_LAST_SYNCED_SHA` so that `ensure_ledger_synced` runs a fresh sync on the next tool call even when HEAD hasn't moved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.1 — fix sync no-op on same commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: ratify prompt fires last, after all decisions printed (ingest step 7) Previously "after ingest" was ambiguous — LLM could fire the ratify AskUserQuestion immediately after bicameral.ingest returned, before the report (step 4), brief (step 5), and gap-judge (step 6) were shown. Now step 7 is explicit: - Must be the last user-facing output of the ingest flow - Multi-segment ingests ratify once at the end of the roll-up, not per segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.2 — ratify prompt ordering fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Preflight eval: §C cost/latency baseline (#90) * test(eval): cost-baseline harness — synthetic ledger + token counter + runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): commit initial Darwin cost baselines Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: enforce exact diagnostic field names in ingest + preflight telemetry LLMs were substituting natural-language names (grounded, ungrounded, channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed names. The events landed in PostHog but fell through every dashboard panel because the queries filter on the prefixed names. Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'") to both bicameral-ingest and bicameral-preflight skill_end sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enforce skill diagnostic schema via Pydantic in skill_end handler Previously diagnostic was an open object — LLMs sent improvised field names (grounded, ungrounded, channels_read) that fell through every dashboard filter. Now: - IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields - skill_end handler validates against the per-skill model; unknown fields are stripped from the PostHog payload and echoed back in diagnostic_warning so the LLM immediately sees what it sent wrong on the same call - inputSchema description enumerates all valid field names so the LLM has them visible at call time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: VHS demo — 5 core use case flows (ingest, preflight, sync, history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove demo directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.4 — branch-scoped ephemeral bind + stale hash repair B9: handlers/bind.py used authoritative_sha for all file checks and hash computation regardless of branch. On feature branches this caused (1) spurious rejection of branch-local files and (2) phantom "drifted" status after resolve_compliance because bind stored H_main while link_commit computed H_branch. Fix: detect _is_ephemeral_commit and use head_sha as effective_ref. B10: ingest_commit's already_synced early-return left stale "reflected" status when returning to main after feature-branch bind work. The repair path in the already_synced branch now uses get_regions_with_ephemeral_verdicts (indexed lookup via idx_cc_ephemeral) to find only suspect regions, updates their hashes to the authoritative content, and re-projects decision status. Two-pass approach deduplicates project_decision_status calls per decision. Tests: E18-E22 added (22/22 ephemeral/authoritative scenarios pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: set RECOMMENDED_VERSION to 0.13.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(eval): real-ledger seeder for cost/latency baselines Stage 6 of issue #88 path-3 rework. Adds `tests/eval/_seed_ledger.py` — translates a synthetic HistoryResponse-shaped dict (from the existing generator) into real SurrealDB writes via `adapter.ingest_payload`, the production ingestion path. Uses the synthetic-repo fallback (repo path not on disk → empty content_hash) so seeding works without git fixtures. Status overrides post-ingest via `update_decision_status` to match the synthetic generator's intended distribution (70% reflected / 20% drifted / 10% other) — bypasses derive_status since there's no real file content. Three new unit tests: - N=10 seeds 30 decisions, ledger contains exactly that count - N=100 status distribution roughly matches synthetic generator's - Empty input returns 0 Stage 7 will use this seeder to run C2 + C3 against real seeded ledgers instead of mocked queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): C2/C3 against real seeded ledger, parametrized by N=10/100/1000 Stage 7 of issue #88 path-3 rework. Addresses Jin's "test not very useful if it doesnt capture updates" feedback by switching C2 and C3 from mocked ledger queries to a real `memory://` SurrealDB seeded with N synthetic features. The handler now executes the real SurrealDB query path on every measurement — same code the developer hits in production. Real-I/O baselines (Darwin local, Python 3.12 + SurrealDB 2.x): | N | C2 tokens / bytes | C3 p50 / p95 | |---|---|---| | 10 | 566 / 2,303 | 2.5ms / 3.0ms | | 100 | 571 / 2,303 | 14.8ms / 15.9ms | | 1000 | 575 / 2,303 | 138.8ms / 141.7ms | C3 latency at N=1000 is ~1700× the previous mocked baseline (138ms vs 0.08ms). That's the user-experience-relevant signal — and exactly the regression target an optimization PR (#58 directions: semantic prefilter, lazy/two-pass history) should reduce. Platform tagging: - C1: `recorded_on=any` (token counts are deterministic across OSes) - C2: `recorded_on=any` (response shape is deterministic given same seed; noise floor absorbs sync_metrics timing variance) - C3: per-platform `darwin` (real I/O latency varies meaningfully by host; Linux baselines must be recorded separately on a Linux runner) Schema additions: - `_baseline_io.ANY_PLATFORM` sentinel — a row with this value matches every host. `find_baseline` now treats `recorded_on=any` rows as matches regardless of caller's platform. - `_record_or_assert(platform_agnostic=True)` records and matches with the sentinel. Implementation notes: - C2/C3 each spin up a fresh adapter per parametrized run — no cross-test state, no singleton reset needed. - file_paths chosen from synthetic decisions via `_pick_grounded_paths` to guarantee region-anchored matches (response fires non-trivially). - Seeding cost: ~62s at N=1000 (3000 ingest_payload mappings through the real ingest path + status updates). Total cost-eval runtime: ~2m30s. Acceptable for advisory CI; non-blocking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(catalog): refresh §C wording for real-ledger C2/C3 Stage 8 of issue #88 path-3 rework. Updates the catalog's §C entries to reflect that C2 + C3 now measure against a real seeded ledger, not mocked queries. Adds the real-ledger seeder to the implementation queue ticked items and clarifies the per-platform vs platform-agnostic split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: jinhongkuan <kuanjh123@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: WulfForge <krknapp@gmail.com>

silongtan temporarily deployed to ci-test April 29, 2026 03:26 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

silongtan and others added 3 commits April 28, 2026 23:32

silongtan force-pushed the preflight-cost-baseline branch from 531067a to bbc5933 Compare April 29, 2026 03:33

silongtan temporarily deployed to ci-test April 29, 2026 03:33 — with GitHub Actions Inactive

jinhongkuan self-requested a review April 29, 2026 03:42

jinhongkuan approved these changes Apr 29, 2026

View reviewed changes

jinhongkuan merged commit dde17e7 into main Apr 29, 2026
5 checks passed

silongtan mentioned this pull request Apr 29, 2026

Preflight eval phase 3 — real-I/O C2/C3 measurement (Jin's review feedback) #96

Merged

4 tasks

silongtan mentioned this pull request May 4, 2026

M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preflight eval: §C cost/latency baseline#90

Preflight eval: §C cost/latency baseline#90
jinhongkuan merged 3 commits into
mainfrom
preflight-cost-baseline

silongtan commented Apr 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

silongtan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics + baselines

What this test actually catches

What this test explicitly does NOT catch

Architecture

Recording flow

CI behavior

Test plan

Closes / unblocks

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

silongtan commented Apr 29, 2026 •

edited

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading