Preflight eval: §C cost/latency baseline#90
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughAdds an advisory CI step to run preflight cost/latency evaluations against a committed baseline. Introduces helper modules for baselines, token counting, and synthetic ledger generation; a pytest runner covering C1–C3; a committed baseline JSONL; documentation updates; a new test dependency; and a comprehensive helper test suite. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer
participant CI as GitHub Actions
participant Py as Pytest Runner
participant BL as Baseline IO
participant LG as Synthetic Ledger
participant TK as Tokenizer
participant HD as Preflight Handler
participant JR as JUnit Report
Dev->>CI: Push/PR
CI->>Py: Run preflight cost eval (advisory)
Py->>BL: Load committed baselines
alt C1: history payload tokens
Py->>LG: Generate synthetic ledger (N features)
LG-->>Py: Deterministic payload
Py->>TK: Count tokens (cl100k_base)
TK-->>Py: Token count
Py->>BL: Regression check (C1)
else C2: preflight response size
Py->>HD: handle_preflight(mocked ctx)
HD-->>Py: Response JSON
Py->>TK: Count tokens/bytes
TK-->>Py: Sizes
Py->>BL: Regression check (C2)
else C3: handler latency
Py->>HD: Warm + timed calls
HD-->>Py: Latency samples
Py->>BL: Regression checks (p50, p95)
end
Py-->>JR: Write test results (JUnit XML)
CI-->>Dev: Report (non-blocking)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 53 minutes and 20 seconds.Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (5)
tests/eval/run_preflight_cost_eval.py (2)
29-29: Unused import:asyncio.The
asynciomodule is imported but not used directly — the async tests rely onpytest-asyncioto run the event loop.🧹 Remove unused import
-import asyncio import sys🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/run_preflight_cost_eval.py` at line 29, Remove the unused asyncio import from the top of the test file: delete the line importing asyncio in run_preflight_cost_eval.py since pytest-asyncio provides the event loop and no symbol from asyncio is referenced (there are no functions/classes in this diff like run_preflight_cost_eval that require asyncio directly).
322-324: Minor: p95 index calculation is slightly off.With
_C3_SAMPLES=100,int(len(timings_ms) * 0.95)yields index 95, which accesses the 96th element in the sorted list. For a true p95, you'd typically use index 94 (the 95th element out of 100).That said, the difference is negligible for this use case (benchmarking handler latency), and the value will still be close to p95.
💡 Standard percentile calculation
timings_ms.sort() p50 = timings_ms[len(timings_ms) // 2] - p95 = timings_ms[int(len(timings_ms) * 0.95)] + p95 = timings_ms[int(len(timings_ms) * 0.95) - 1] # 0-indexed: 95th of 100 is index 94Or use
statistics.quantilesfor standard behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/run_preflight_cost_eval.py` around lines 322 - 324, The p95 index calculation currently uses int(len(timings_ms) * 0.95) which yields index 95 for _C3_SAMPLES=100 (the 96th element); change the p95 calculation to use a zero-based 95th percentile index (e.g., idx = math.ceil(len(timings_ms) * 0.95) - 1) and set p95 = timings_ms[idx]; ensure you import math if needed. Keep the p50 logic as-is (timings_ms[len(timings_ms) // 2]) or alternatively replace both with statistics.quantiles/two-line percentile helper if you prefer standard behavior.tests/eval/test_cost_baseline_helpers.py (1)
161-165: Potential test fragility: JSON string literal vsjson.dumpsoutput.The direct string
'{"foo": "bar", "n": 42}'assumes a specific key ordering and spacing thatjson.dumpsproduces. While CPython 3.7+ preserves dict insertion order, the assertion relies onjson.dumpsproducing no trailing spaces and the exact same key order. This works today but could become fragile.Consider using the JSON function for both sides to ensure consistency:
💡 More robust comparison
def test_count_tokens_json_matches_direct_serialize(): payload = {"foo": "bar", "n": 42} - direct = count_tokens('{"foo": "bar", "n": 42}') + import json + direct = count_tokens(json.dumps(payload, ensure_ascii=False)) via_json = count_tokens_json(payload) assert direct == via_json🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/test_cost_baseline_helpers.py` around lines 161 - 165, The test uses a hard-coded JSON literal which can be fragile due to spacing/ordering differences; update test_count_tokens_json_matches_direct_serialize to produce the direct string via json.dumps(payload) (or json.dumps(payload, separators=(',', ':'), sort_keys=True) for a canonical form) and then call count_tokens on that string so both sides use the same JSON serialization; reference functions: test_count_tokens_json_matches_direct_serialize, count_tokens, count_tokens_json, and json.dumps.tests/eval/_baseline_io.py (2)
65-75: Use atomic replace when writing baseline files.Line [75] writes directly to the target path; an interrupted write can corrupt the JSONL. A temp-file + replace pattern is safer in record mode.
♻️ Proposed improvement
def write_baselines(rows: list[dict], path: Path = BASELINE_PATH) -> None: """Sorted, stable-key JSONL output to keep diffs minimal.""" def _sort_key(row: dict) -> tuple: return ( row.get("metric", ""), row.get("recorded_on", ""), row.get("n_features", -1), ) rows_sorted = sorted(rows, key=_sort_key) body = "\n".join(json.dumps(r, sort_keys=True, ensure_ascii=False) for r in rows_sorted) - path.write_text(body + "\n", encoding="utf-8") + tmp_path = path.with_suffix(path.suffix + ".tmp") + tmp_path.write_text(body + "\n", encoding="utf-8") + tmp_path.replace(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/_baseline_io.py` around lines 65 - 75, The write_baselines function currently writes directly to BASELINE_PATH which can corrupt the file if interrupted; change write_baselines to write the JSONL content to a temporary file in the same directory (e.g., using tempfile or Path with a unique suffix) and then atomically replace the target with os.replace (or Path.replace) so the final write is atomic; ensure the temp file is opened with utf-8 and that you still write the newline-terminated body and clean up the temp on error.
54-62: Add line-context error reporting for malformed JSONL.Lines [58]-[61] currently raise raw
JSONDecodeError, which makes broken baseline rows harder to diagnose. Consider surfacing file + line number.♻️ Proposed improvement
def load_baselines(path: Path = BASELINE_PATH) -> list[dict]: if not path.exists(): return [] rows = [] - for line in path.read_text(encoding="utf-8").splitlines(): + for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1): line = line.strip() if line: - rows.append(json.loads(line)) + try: + rows.append(json.loads(line)) + except json.JSONDecodeError as exc: + raise ValueError(f"Malformed baseline JSONL at {path} Line [{lineno}]") from exc return rows🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/_baseline_io.py` around lines 54 - 62, Wrap the json.loads call inside load_baselines with a try/except that catches json.JSONDecodeError, track the current line number using enumerate(path.read_text(...).splitlines(), start=1) so you can include the file path and line number in the error, and re-raise a clearer error (e.g., raise ValueError(f"Malformed JSON in {path} at line {lineno}: {e}") from e) so the original JSONDecodeError is preserved as the __cause__; update the rows.append(json.loads(line)) call in load_baselines accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/eval/_baseline_io.py`:
- Around line 65-75: The write_baselines function currently writes directly to
BASELINE_PATH which can corrupt the file if interrupted; change write_baselines
to write the JSONL content to a temporary file in the same directory (e.g.,
using tempfile or Path with a unique suffix) and then atomically replace the
target with os.replace (or Path.replace) so the final write is atomic; ensure
the temp file is opened with utf-8 and that you still write the
newline-terminated body and clean up the temp on error.
- Around line 54-62: Wrap the json.loads call inside load_baselines with a
try/except that catches json.JSONDecodeError, track the current line number
using enumerate(path.read_text(...).splitlines(), start=1) so you can include
the file path and line number in the error, and re-raise a clearer error (e.g.,
raise ValueError(f"Malformed JSON in {path} at line {lineno}: {e}") from e) so
the original JSONDecodeError is preserved as the __cause__; update the
rows.append(json.loads(line)) call in load_baselines accordingly.
In `@tests/eval/run_preflight_cost_eval.py`:
- Line 29: Remove the unused asyncio import from the top of the test file:
delete the line importing asyncio in run_preflight_cost_eval.py since
pytest-asyncio provides the event loop and no symbol from asyncio is referenced
(there are no functions/classes in this diff like run_preflight_cost_eval that
require asyncio directly).
- Around line 322-324: The p95 index calculation currently uses
int(len(timings_ms) * 0.95) which yields index 95 for _C3_SAMPLES=100 (the 96th
element); change the p95 calculation to use a zero-based 95th percentile index
(e.g., idx = math.ceil(len(timings_ms) * 0.95) - 1) and set p95 =
timings_ms[idx]; ensure you import math if needed. Keep the p50 logic as-is
(timings_ms[len(timings_ms) // 2]) or alternatively replace both with
statistics.quantiles/two-line percentile helper if you prefer standard behavior.
In `@tests/eval/test_cost_baseline_helpers.py`:
- Around line 161-165: The test uses a hard-coded JSON literal which can be
fragile due to spacing/ordering differences; update
test_count_tokens_json_matches_direct_serialize to produce the direct string via
json.dumps(payload) (or json.dumps(payload, separators=(',', ':'),
sort_keys=True) for a canonical form) and then call count_tokens on that string
so both sides use the same JSON serialization; reference functions:
test_count_tokens_json_matches_direct_serialize, count_tokens,
count_tokens_json, and json.dumps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 19b3775b-88f2-43c6-bb46-3a151777aab0
📒 Files selected for processing (9)
.github/workflows/preflight-eval.ymldocs/preflight-failure-scenarios.mdpyproject.tomltests/eval/_baseline_io.pytests/eval/_synthetic_ledger.pytests/eval/_token_count.pytests/eval/cost_baseline.jsonltests/eval/run_preflight_cost_eval.pytests/eval/test_cost_baseline_helpers.py
…+ runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
531067a to
bbc5933
Compare
* chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode
- Skill-level telemetry: replace per-tool timing with bicameral.skill_begin /
bicameral.skill_end bookend tools; record_skill_event replaces record_event
- Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload
interface; relay now validates only distinct_id + version + diagnostic numeric
invariant, all other fields pass through — future event types require no relay
redeploy; deployed to Cloudflare (v a6acec14)
- telemetry.py: add send_event() open primitive; record_skill_event is a thin
wrapper; setup_wizard consent UI updated to show new skill-level payload shape
- reset wipe_mode: ledger (default, DB rows only, server stays live) vs full
(deletes entire .bicameral/ dir including config + event files, reinits schema)
- ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row
traversal — simpler, faster, correct for embedded surrealkv
- events/team_adapter.py: add explicit wipe_all_rows that resets event watermark
- contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields
- skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation
phrasing; full mode requires showing bicameral_dir before confirm
- tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry
- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
collision_unresolved, drift_mislabeled, low_confidence_verdict,
ledger_empty, grounding_failed, user_abort, other) replacing the
boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: delete stale bicameral-drift and bicameral-scan-branch skills
Both reference tools (bicameral.drift, bicameral.scan_branch) that no
longer exist in the server. Drift detection is handled by link_commit
+ auto-sync middleware + resolve_compliance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: remove embedded worktree from index, ignore .claude/worktrees
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: pass --no-cache-dir to pip install in update handler
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: use pipx install --force for upgrades, fall back to pip
sys.executable -m pip fails on Homebrew Python (externally-managed-
environment). pipx is the standard install path and handles its own
venv correctly. pipx also doesn't support --no-cache-dir so that flag
is dropped from the pip fallback path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-mcp reset CLI — questionary wizard before wiping
Adds a `bicameral reset` subcommand that:
1. Prompts for wipe mode (ledger vs full) via questionary select
2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir
for full mode with a ⚠️ warning)
3. Asks for explicit confirmation before calling handle_reset
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-mcp config CLI — questionary wizard for config.yaml
Adds a `bicameral config` subcommand that:
1. Reads current config.yaml values as defaults
2. Prompts for mode, guided, telemetry via questionary selects
with the current value pre-selected
3. Writes updated config.yaml
4. Reinstalls skills and hooks so changes take effect immediately
Replaces the LLM-in-chat text menu in the bicameral-config skill.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: bicameral-config skill uses AskUserQuestion for all three settings
Replaces text-based [1/2] menus with a single AskUserQuestion call
covering mode, guided, and telemetry — all in one interactive prompt
within the Claude session.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.12.2 — CLI wizards + telemetry quality loop
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add Dependabot for weekly pip dependency updates
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter
Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate
AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
AskUserQuestion, batched in groups of 4 for all correction counts
Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: bump RECOMMENDED_VERSION to 0.13.0
Was left at 0.12.2 — update handler checks this file to detect available upgrades.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: surface pending decisions when sync no-ops on same commit
After ingest, `bicameral sync` could return 'already_synced' with zero
compliance checks when HEAD hadn't moved — leaving newly-ingested decisions
stuck at `pending` indefinitely.
Two-part fix:
1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return,
query `get_pending_decisions_with_regions()` and include any pending
decisions as `pending_compliance_checks` in the response.
2. `handlers/link_commit.py` `invalidate_sync_cache` + new
`sync_middleware.invalidate_process_cache()`: after any mutation (ingest,
update, reset), clear the process-level `_LAST_SYNCED_SHA` so that
`ensure_ledger_synced` runs a fresh sync on the next tool call even when
HEAD hasn't moved.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.1 — fix sync no-op on same commit
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: ratify prompt fires last, after all decisions printed (ingest step 7)
Previously "after ingest" was ambiguous — LLM could fire the ratify
AskUserQuestion immediately after bicameral.ingest returned, before the
report (step 4), brief (step 5), and gap-judge (step 6) were shown.
Now step 7 is explicit:
- Must be the last user-facing output of the ingest flow
- Multi-segment ingests ratify once at the end of the roll-up, not per segment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.2 — ratify prompt ordering fix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Preflight eval: §C cost/latency baseline (#90)
* test(eval): cost-baseline harness — synthetic ledger + token counter + runner
Stage 1-4 of issue #88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight
C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.
Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.
The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.
13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(eval): commit initial Darwin cost baselines
Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)
The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in #58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).
Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.
Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C
Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.
Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: enforce exact diagnostic field names in ingest + preflight telemetry
LLMs were substituting natural-language names (grounded, ungrounded,
channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed
names. The events landed in PostHog but fell through every dashboard panel
because the queries filter on the prefixed names.
Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'")
to both bicameral-ingest and bicameral-preflight skill_end sections.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: enforce skill diagnostic schema via Pydantic in skill_end handler
Previously diagnostic was an open object — LLMs sent improvised field names
(grounded, ungrounded, channels_read) that fell through every dashboard filter.
Now:
- IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py
with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields
- skill_end handler validates against the per-skill model; unknown fields are
stripped from the PostHog payload and echoed back in diagnostic_warning so
the LLM immediately sees what it sent wrong on the same call
- inputSchema description enumerates all valid field names so the LLM has
them visible at call time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: jinhongkuan <kuanjh123@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Silong Tan <silongtan@outlook.com>
…dback) (#96) * chore: bump to v0.11.0 — CodeGenome Phase 1+2 adapter + identity records Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.0 — skill telemetry, extensible relay, reset wipe_mode - Skill-level telemetry: replace per-tool timing with bicameral.skill_begin / bicameral.skill_end bookend tools; record_skill_event replaces record_event - Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload interface; relay now validates only distinct_id + version + diagnostic numeric invariant, all other fields pass through — future event types require no relay redeploy; deployed to Cloudflare (v a6acec14) - telemetry.py: add send_event() open primitive; record_skill_event is a thin wrapper; setup_wizard consent UI updated to show new skill-level payload shape - reset wipe_mode: ledger (default, DB rows only, server stays live) vs full (deletes entire .bicameral/ dir including config + event files, reinits schema) - ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row traversal — simpler, faster, correct for embedded surrealkv - events/team_adapter.py: add explicit wipe_all_rows that resets event watermark - contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields - skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation phrasing; full mode requires showing bicameral_dir before confirm - tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.12.1 — rationale, error_class, and bicameral.feedback telemetry - bicameral.skill_begin now accepts `rationale` (why the skill triggered) stored in _skill_sessions dict alongside t0 and forwarded at skill_end - bicameral.skill_end now accepts `error_class` enum (symbol_not_found, collision_unresolved, drift_mislabeled, low_confidence_verdict, ledger_empty, grounding_failed, user_abort, other) replacing the boolean-only errored signal - New bicameral.feedback tool: call when stuck — records {trying_to, attempted, stuck_on} as agent_feedback events mapping to desync catalog - All 8 major skills updated with Telemetry bookend sections showing the skill_begin/skill_end pattern with rationale + error_class examples - telemetry.record_skill_event extended with error_class and rationale kwargs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: delete stale bicameral-drift and bicameral-scan-branch skills Both reference tools (bicameral.drift, bicameral.scan_branch) that no longer exist in the server. Drift detection is handled by link_commit + auto-sync middleware + resolve_compliance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove embedded worktree from index, ignore .claude/worktrees Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: pass --no-cache-dir to pip install in update handler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use pipx install --force for upgrades, fall back to pip sys.executable -m pip fails on Homebrew Python (externally-managed- environment). pipx is the standard install path and handles its own venv correctly. pipx also doesn't support --no-cache-dir so that flag is dropped from the pip fallback path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp reset CLI — questionary wizard before wiping Adds a `bicameral reset` subcommand that: 1. Prompts for wipe mode (ledger vs full) via questionary select 2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir for full mode with a⚠️ warning) 3. Asks for explicit confirmation before calling handle_reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-mcp config CLI — questionary wizard for config.yaml Adds a `bicameral config` subcommand that: 1. Reads current config.yaml values as defaults 2. Prompts for mode, guided, telemetry via questionary selects with the current value pre-selected 3. Writes updated config.yaml 4. Reinstalls skills and hooks so changes take effect immediately Replaces the LLM-in-chat text menu in the bicameral-config skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: bicameral-config skill uses AskUserQuestion for all three settings Replaces text-based [1/2] menus with a single AskUserQuestion call covering mode, guided, and telemetry — all in one interactive prompt within the Claude session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.12.2 — CLI wizards + telemetry quality loop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: add Dependabot for weekly pip dependency updates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: v0.13.0 — gate telemetry schema, AskUserQuestion ground truth, liberal ingest filter Telemetry schema (all skills): - g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest, G9/G10/G11 in preflight, G11 in capture-corrections) - skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled - g{N}_user_overrode as universal ground-truth signal at every interactive gate AskUserQuestion ground truth wiring: - G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops, batched in groups of 4; guarded by guided_mode - G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss irrelevant findings; guarded by guided_mode; populates g10_user_overrode - G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with AskUserQuestion, batched in groups of 4 for all correction counts Liberal ingest filter: - Removed aspirational, hedged conditional, and parked/deferred from hard-exclude; these now flow through level classification and gate filters as speculative proposals - Ratification is the team's judgment layer, not the extraction filter - Updated Example 1: now extracts 3 speculative proposals instead of 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: bump RECOMMENDED_VERSION to 0.13.0 Was left at 0.12.2 — update handler checks this file to detect available upgrades. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: surface pending decisions when sync no-ops on same commit After ingest, `bicameral sync` could return 'already_synced' with zero compliance checks when HEAD hadn't moved — leaving newly-ingested decisions stuck at `pending` indefinitely. Two-part fix: 1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return, query `get_pending_decisions_with_regions()` and include any pending decisions as `pending_compliance_checks` in the response. 2. `handlers/link_commit.py` `invalidate_sync_cache` + new `sync_middleware.invalidate_process_cache()`: after any mutation (ingest, update, reset), clear the process-level `_LAST_SYNCED_SHA` so that `ensure_ledger_synced` runs a fresh sync on the next tool call even when HEAD hasn't moved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.1 — fix sync no-op on same commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: ratify prompt fires last, after all decisions printed (ingest step 7) Previously "after ingest" was ambiguous — LLM could fire the ratify AskUserQuestion immediately after bicameral.ingest returned, before the report (step 4), brief (step 5), and gap-judge (step 6) were shown. Now step 7 is explicit: - Must be the last user-facing output of the ingest flow - Multi-segment ingests ratify once at the end of the roll-up, not per segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.2 — ratify prompt ordering fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Preflight eval: §C cost/latency baseline (#90) * test(eval): cost-baseline harness — synthetic ledger + token counter + runner Stage 1-4 of issue #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in #58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): commit initial Darwin cost baselines Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in #58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: enforce exact diagnostic field names in ingest + preflight telemetry LLMs were substituting natural-language names (grounded, ungrounded, channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed names. The events landed in PostHog but fell through every dashboard panel because the queries filter on the prefixed names. Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'") to both bicameral-ingest and bicameral-preflight skill_end sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: enforce skill diagnostic schema via Pydantic in skill_end handler Previously diagnostic was an open object — LLMs sent improvised field names (grounded, ungrounded, channels_read) that fell through every dashboard filter. Now: - IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields - skill_end handler validates against the per-skill model; unknown fields are stripped from the PostHog payload and echoed back in diagnostic_warning so the LLM immediately sees what it sent wrong on the same call - inputSchema description enumerates all valid field names so the LLM has them visible at call time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: VHS demo — 5 core use case flows (ingest, preflight, sync, history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove demo directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump to v0.13.4 — branch-scoped ephemeral bind + stale hash repair B9: handlers/bind.py used authoritative_sha for all file checks and hash computation regardless of branch. On feature branches this caused (1) spurious rejection of branch-local files and (2) phantom "drifted" status after resolve_compliance because bind stored H_main while link_commit computed H_branch. Fix: detect _is_ephemeral_commit and use head_sha as effective_ref. B10: ingest_commit's already_synced early-return left stale "reflected" status when returning to main after feature-branch bind work. The repair path in the already_synced branch now uses get_regions_with_ephemeral_verdicts (indexed lookup via idx_cc_ephemeral) to find only suspect regions, updates their hashes to the authoritative content, and re-projects decision status. Two-pass approach deduplicates project_decision_status calls per decision. Tests: E18-E22 added (22/22 ephemeral/authoritative scenarios pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: set RECOMMENDED_VERSION to 0.13.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(eval): real-ledger seeder for cost/latency baselines Stage 6 of issue #88 path-3 rework. Adds `tests/eval/_seed_ledger.py` — translates a synthetic HistoryResponse-shaped dict (from the existing generator) into real SurrealDB writes via `adapter.ingest_payload`, the production ingestion path. Uses the synthetic-repo fallback (repo path not on disk → empty content_hash) so seeding works without git fixtures. Status overrides post-ingest via `update_decision_status` to match the synthetic generator's intended distribution (70% reflected / 20% drifted / 10% other) — bypasses derive_status since there's no real file content. Three new unit tests: - N=10 seeds 30 decisions, ledger contains exactly that count - N=100 status distribution roughly matches synthetic generator's - Empty input returns 0 Stage 7 will use this seeder to run C2 + C3 against real seeded ledgers instead of mocked queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): C2/C3 against real seeded ledger, parametrized by N=10/100/1000 Stage 7 of issue #88 path-3 rework. Addresses Jin's "test not very useful if it doesnt capture updates" feedback by switching C2 and C3 from mocked ledger queries to a real `memory://` SurrealDB seeded with N synthetic features. The handler now executes the real SurrealDB query path on every measurement — same code the developer hits in production. Real-I/O baselines (Darwin local, Python 3.12 + SurrealDB 2.x): | N | C2 tokens / bytes | C3 p50 / p95 | |---|---|---| | 10 | 566 / 2,303 | 2.5ms / 3.0ms | | 100 | 571 / 2,303 | 14.8ms / 15.9ms | | 1000 | 575 / 2,303 | 138.8ms / 141.7ms | C3 latency at N=1000 is ~1700× the previous mocked baseline (138ms vs 0.08ms). That's the user-experience-relevant signal — and exactly the regression target an optimization PR (#58 directions: semantic prefilter, lazy/two-pass history) should reduce. Platform tagging: - C1: `recorded_on=any` (token counts are deterministic across OSes) - C2: `recorded_on=any` (response shape is deterministic given same seed; noise floor absorbs sync_metrics timing variance) - C3: per-platform `darwin` (real I/O latency varies meaningfully by host; Linux baselines must be recorded separately on a Linux runner) Schema additions: - `_baseline_io.ANY_PLATFORM` sentinel — a row with this value matches every host. `find_baseline` now treats `recorded_on=any` rows as matches regardless of caller's platform. - `_record_or_assert(platform_agnostic=True)` records and matches with the sentinel. Implementation notes: - C2/C3 each spin up a fresh adapter per parametrized run — no cross-test state, no singleton reset needed. - file_paths chosen from synthetic decisions via `_pick_grounded_paths` to guarantee region-anchored matches (response fires non-trivially). - Seeding cost: ~62s at N=1000 (3000 ingest_payload mappings through the real ingest path + status updates). Total cost-eval runtime: ~2m30s. Acceptable for advisory CI; non-blocking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(catalog): refresh §C wording for real-ledger C2/C3 Stage 8 of issue #88 path-3 rework. Updates the catalog's §C entries to reflect that C2 + C3 now measure against a real seeded ledger, not mocked queries. Adds the real-ledger seeder to the implementation queue ticked items and clarifies the per-platform vs platform-agnostic split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: jinhongkuan <kuanjh123@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: WulfForge <krknapp@gmail.com>
Summary
Implements #88 — measurement infrastructure for the catalog's §C cost/latency baseline. Prerequisite for any optimization PR against #58 to have a regression target.
Scope evolved during review: original C3 latency measurement used mocked ledger queries, which Jin correctly flagged as not capturing real updates. Reworked to real
memory://SurrealDB seeded with synthetic data through the productioningest_payloadpath — every C2/C3 measurement now exercises the real SurrealDB query plan, handler iteration, and serialization.Metrics + baselines
bicameral.history()payload tokensbicameral.preflight()response tokensbicameral.preflight()response bytesTwo punch lines from the data:
bicameral.history()call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. The §C concern is concrete and material.What this test actually catches
HistoryResponse→ C1 token count grows → flagPreflightResponseshape → C2 byte count grows → flagWhat this test explicitly does NOT catch
The three together form a measurement strategy: phase 3 is the synthetic baseline that holds optimization claims accountable; #65 + #66 cover what synthetic doesn't.
Architecture
tests/eval/_synthetic_ledger.pyHistoryResponse-shaped dictstests/eval/_seed_ledger.pyadapter.ingest_payloadtests/eval/_token_count.pycl100k_basewrappertests/eval/_baseline_io.pytests/eval/run_preflight_cost_eval.pytests/eval/test_cost_baseline_helpers.pytests/eval/cost_baseline.jsonl.github/workflows/preflight-eval.ymldocs/preflight-failure-scenarios.mdpyproject.toml[test]extrasRecording flow
BICAMERAL_EVAL_RECORD_BASELINE=1 pytest tests/eval/run_preflight_cost_eval.py git add tests/eval/cost_baseline.jsonl git commit -m "test(eval): re-record C* baselines after <intentional change>"The harness regenerates rows for the current platform; diff is reviewable in the PR. C1/C2 use platform-agnostic baselines (token counts and response bytes are deterministic across OSes). C3 latency is per-platform — Linux baselines need to be recorded separately on a Linux runner before CI Linux validates C3.
CI behavior
preflight-eval.yml, advisory (continue-on-error: true)recorded_on=anyrows; C3 skips with re-record instructions until Linux baseline is addedTest plan
pytest tests/eval/run_preflight_cost_eval.py— 9 passed (3× C1, 3× C2, 3× C3) against new real-ledger baselinespytest tests/eval/test_cost_baseline_helpers.py— 35 passed (helpers + seeder + regression rule + IO)Closes / unblocks
🤖 Generated with Claude Code