Skip to content

chore: merge main into dev (v0.13.3 telemetry refactor → dev)#94

Merged
jinhongkuan merged 23 commits into
BicameralAI:devfrom
Knapp-Kevin:sync/dev-merge-main
Apr 29, 2026
Merged

chore: merge main into dev (v0.13.3 telemetry refactor → dev)#94
jinhongkuan merged 23 commits into
BicameralAI:devfrom
Knapp-Kevin:sync/dev-merge-main

Conversation

@Knapp-Kevin

Copy link
Copy Markdown
Collaborator

Summary

Brings 22 commits from `main` (up to v0.13.3) onto `dev`. Per maintainer policy, dev should always be ahead of main; this merge restores that invariant after parallel work landed on both branches.

What dev is missing from main (this PR brings in)

Version Highlights
v0.11.0 CodeGenome Phase 1+2 release artifacts
v0.12.0 Skill-level telemetry (record_skill_event), extensible relay, reset wipe_mode
v0.12.1 bicameral.feedback tool, error_class enum, rationale field
v0.12.2 questionary CLI wizards (config, reset)
v0.13.0 Gate telemetry schema (g{N}_ prefix), AskUserQuestion ground truth, liberal ingest filter (speculative proposals)
v0.13.1 Pending decisions surfaced when sync no-ops
v0.13.2 Ratify prompt ordering fix
v0.13.3 Pydantic diagnostic enforcement + telemetry field fix

What dev already has (preserved through the merge)

Conflict resolution

One conflict in `contracts.py` — both branches modified the pydantic import line. Resolved by combining: `from pydantic import BaseModel, ConfigDict, Field` (dev needed `Field` for line 278; main added `ConfigDict` for the new `IngestDiagnostic` / `PreflightDiagnostic` models). All new diagnostic model classes from main are preserved.

Test results

Full suite post-merge on Windows: 535 pass, 9 fail, 3 skip, 1 xfail.

The 9 failures are pre-existing issues unrelated to the merge:

  • 2× UnicodeDecodeError (cp1252 → UTF-8 markdown read) — separate Windows encoding issue
  • 4× AssertionError on assertion mismatches (test_bind, test_desync_scenarios, test_sync_middleware, test_v0420_history) — falls under #70's AssertionError cluster
  • 3× attribute errors in test_v0420_history — same cluster

Improvement vs pre-merge dev: the Windows subprocess fix from #84 (already on dev) eliminated the WinError 267 cluster, bringing pre-merge dev from 26 failures to 9.

Why this is a merge (not a rebase)

Both branches have merge commits from prior PRs (#73, #80#84, #91 on dev; many on main). Rebasing dev onto main would flatten those merge commits and rewrite the hashes that referenced PRs reference. Merge preserves history without breaking external references.

🤖 Generated with Claude Code

jinhongkuan and others added 23 commits April 28, 2026 15:04
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_mode

- Skill-level telemetry: replace per-tool timing with bicameral.skill_begin /
  bicameral.skill_end bookend tools; record_skill_event replaces record_event
- Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload
  interface; relay now validates only distinct_id + version + diagnostic numeric
  invariant, all other fields pass through — future event types require no relay
  redeploy; deployed to Cloudflare (v a6acec14)
- telemetry.py: add send_event() open primitive; record_skill_event is a thin
  wrapper; setup_wizard consent UI updated to show new skill-level payload shape
- reset wipe_mode: ledger (default, DB rows only, server stays live) vs full
  (deletes entire .bicameral/ dir including config + event files, reinits schema)
- ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row
  traversal — simpler, faster, correct for embedded surrealkv
- events/team_adapter.py: add explicit wipe_all_rows that resets event watermark
- contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields
- skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation
  phrasing; full mode requires showing bicameral_dir before confirm
- tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
  stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
  collision_unresolved, drift_mislabeled, low_confidence_verdict,
  ledger_empty, grounding_failed, user_abort, other) replacing the
  boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
  attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
  the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both reference tools (bicameral.drift, bicameral.scan_branch) that no
longer exist in the server. Drift detection is handled by link_commit
+ auto-sync middleware + resolve_compliance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sys.executable -m pip fails on Homebrew Python (externally-managed-
environment). pipx is the standard install path and handles its own
venv correctly. pipx also doesn't support --no-cache-dir so that flag
is dropped from the pip fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `bicameral reset` subcommand that:
1. Prompts for wipe mode (ledger vs full) via questionary select
2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir
   for full mode with a ⚠️ warning)
3. Asks for explicit confirmation before calling handle_reset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `bicameral config` subcommand that:
1. Reads current config.yaml values as defaults
2. Prompts for mode, guided, telemetry via questionary selects
   with the current value pre-selected
3. Writes updated config.yaml
4. Reinstalls skills and hooks so changes take effect immediately

Replaces the LLM-in-chat text menu in the bicameral-config skill.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces text-based [1/2] menus with a single AskUserQuestion call
covering mode, guided, and telemetry — all in one interactive prompt
within the Claude session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…liberal ingest filter

Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
  G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate

AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
  batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
  irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
  AskUserQuestion, batched in groups of 4 for all correction counts

Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
  these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was left at 0.12.2 — update handler checks this file to detect available upgrades.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After ingest, `bicameral sync` could return 'already_synced' with zero
compliance checks when HEAD hadn't moved — leaving newly-ingested decisions
stuck at `pending` indefinitely.

Two-part fix:
1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return,
   query `get_pending_decisions_with_regions()` and include any pending
   decisions as `pending_compliance_checks` in the response.
2. `handlers/link_commit.py` `invalidate_sync_cache` + new
   `sync_middleware.invalidate_process_cache()`: after any mutation (ingest,
   update, reset), clear the process-level `_LAST_SYNCED_SHA` so that
   `ensure_ledger_synced` runs a fresh sync on the next tool call even when
   HEAD hasn't moved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep 7)

Previously "after ingest" was ambiguous — LLM could fire the ratify
AskUserQuestion immediately after bicameral.ingest returned, before the
report (step 4), brief (step 5), and gap-judge (step 6) were shown.

Now step 7 is explicit:
- Must be the last user-facing output of the ingest flow
- Multi-segment ingests ratify once at the end of the roll-up, not per segment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(eval): cost-baseline harness — synthetic ledger + token counter + runner

Stage 1-4 of issue BicameralAI#88 — measurement infrastructure for the catalog's
§C cost/latency baseline. Three deterministic metrics:
- C1: bicameral.history() payload tokens at N=10/100/1000 features
- C2: bicameral.preflight() response size (tokens + bytes)
- C3: handler latency p50/p95 on bicameral.preflight

C2/C3 use mocked ledger queries so the metric isolates handler-logic +
serialization cost from SurrealDB I/O variance. The optimization
directions in BicameralAI#58 (semantic prefilter, lazy/two-pass history, etc.) all
mutate handler logic, not the ledger.

Asymmetric regression rule: only flags increases, never improvements.
±20% relative threshold with absolute noise floors (10 tokens / 0.5ms)
to absorb timer jitter at sub-ms latency scale. Re-record via
BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional.

The synthetic ledger generator is deterministic given (n_features,
decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows
forces re-record when the corpus changes. Token counter uses tiktoken
cl100k_base — pinned in pyproject [test] extras to prevent silent
count drift.

13 unit tests cover the regression rule + baseline IO directly. 5
runner tests produce the metrics on every PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(eval): commit initial Darwin cost baselines

Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0:
- C1[N=10]: 7,574 tokens
- C1[N=100]: 79,025 tokens
- C1[N=1000]: 795,982 tokens
- C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region
  matches + 2 collision-pending + 2 context-pending)
- C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape)

The N=1000 number lands the §C concern empirically: ~800K tokens for a
single bicameral.history() call fills 80% of Sonnet 4.6's 1M context
before the skill reasons about anything. This is exactly the
optimization target named in BicameralAI#58 (semantic prefilter, lazy/two-pass
history, file-path → feature-group hint).

Linux baselines NOT included — the runner skips cleanly per-platform
when no row exists. Record locally on a Linux host with
BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up.

Token counts are platform-independent (deterministic via tiktoken) but
still tagged recorded_on=darwin for symmetry with C3 latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C

Adds the phase 3 step to the advisory preflight-eval workflow.
continue-on-error: true so a phase 3 failure never blocks merge — same
contract as phase 1 + 2. The existing test-summary glob (test-results/
*.xml) picks up the new junit file automatically.

Catalog implementation queue ticked: C1/C2/C3 all marked baselined,
with a pointer to tests/eval/cost_baseline.jsonl. Regression rule
description updated to reflect the asymmetric + noise-floor design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etry

LLMs were substituting natural-language names (grounded, ungrounded,
channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed
names. The events landed in PostHog but fell through every dashboard panel
because the queries filter on the prefixed names.

Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'")
to both bicameral-ingest and bicameral-preflight skill_end sections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously diagnostic was an open object — LLMs sent improvised field names
(grounded, ungrounded, channels_read) that fell through every dashboard filter.

Now:
- IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py
  with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields
- skill_end handler validates against the per-skill model; unknown fields are
  stripped from the PostHog payload and echoed back in diagnostic_warning so
  the LLM immediately sees what it sent wrong on the same call
- inputSchema description enumerates all valid field names so the LLM has
  them visible at call time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…field fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…3.3) onto dev

Brings 22 commits from main into dev. Dev was ahead of main by 9 commits
(Windows fixes BicameralAI#80BicameralAI#84, doc BicameralAI#79, CodeGenome Phase 3+4 BicameralAI#73/BicameralAI#91) but missing
main's parallel work:

- v0.11.0 — CodeGenome Phase 1+2 release artifacts
- v0.12.0 — skill-level telemetry (record_skill_event), extensible relay,
  reset wipe_mode
- v0.12.1 — bicameral.feedback, error_class enum, rationale field
- v0.12.2 — questionary CLI wizards (config, reset)
- v0.13.0 — gate telemetry schema (g{N}_ prefix), AskUserQuestion ground
  truth wiring, liberal ingest filter (speculative proposals)
- v0.13.1 — pending decisions surfaced when sync no-ops on same commit
- v0.13.2 — ratify prompt ordering fix
- v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix

Per maintainer policy, dev should always be ahead of main. After this
merge, dev contains both telemetry work (from main) and Windows fixes +
CodeGenome Phase 3+4 (already on dev). Future forward-merges to main
include the dev-only work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts:
#	contracts.py
@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f910fae6-41ab-4e8d-811b-1370646a1861

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jinhongkuan jinhongkuan merged commit e3d066d into BicameralAI:dev Apr 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants