chore: merge main into dev (v0.13.3 telemetry refactor → dev)#94
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_mode - Skill-level telemetry: replace per-tool timing with bicameral.skill_begin / bicameral.skill_end bookend tools; record_skill_event replaces record_event - Extensible relay: remove ALLOWED_TOOLS allowlist and strict EventPayload interface; relay now validates only distinct_id + version + diagnostic numeric invariant, all other fields pass through — future event types require no relay redeploy; deployed to Cloudflare (v a6acec14) - telemetry.py: add send_event() open primitive; record_skill_event is a thin wrapper; setup_wizard consent UI updated to show new skill-level payload shape - reset wipe_mode: ledger (default, DB rows only, server stays live) vs full (deletes entire .bicameral/ dir including config + event files, reinits schema) - ledger/adapter.py: wipe_all_rows now close-and-delete instead of row-by-row traversal — simpler, faster, correct for embedded surrealkv - events/team_adapter.py: add explicit wipe_all_rows that resets event watermark - contracts.py: ResetResponse gains wipe_mode + bicameral_dir fields - skills/bicameral-reset/SKILL.md: updated with two-mode table and confirmation phrasing; full mode requires showing bicameral_dir before confirm - tests: new test_reset_full_wipe_deletes_bicameral_dir (5/5 pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bicameral.skill_begin now accepts `rationale` (why the skill triggered)
stored in _skill_sessions dict alongside t0 and forwarded at skill_end
- bicameral.skill_end now accepts `error_class` enum (symbol_not_found,
collision_unresolved, drift_mislabeled, low_confidence_verdict,
ledger_empty, grounding_failed, user_abort, other) replacing the
boolean-only errored signal
- New bicameral.feedback tool: call when stuck — records {trying_to,
attempted, stuck_on} as agent_feedback events mapping to desync catalog
- All 8 major skills updated with Telemetry bookend sections showing
the skill_begin/skill_end pattern with rationale + error_class examples
- telemetry.record_skill_event extended with error_class and rationale kwargs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both reference tools (bicameral.drift, bicameral.scan_branch) that no longer exist in the server. Drift detection is handled by link_commit + auto-sync middleware + resolve_compliance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sys.executable -m pip fails on Homebrew Python (externally-managed- environment). pipx is the standard install path and handles its own venv correctly. pipx also doesn't support --no-cache-dir so that flag is dropped from the pip fallback path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `bicameral reset` subcommand that: 1. Prompts for wipe mode (ledger vs full) via questionary select 2. Shows a dry-run summary (cursor count, replay plan, bicameral_dir for full mode with a⚠️ warning) 3. Asks for explicit confirmation before calling handle_reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a `bicameral config` subcommand that: 1. Reads current config.yaml values as defaults 2. Prompts for mode, guided, telemetry via questionary selects with the current value pre-selected 3. Writes updated config.yaml 4. Reinstalls skills and hooks so changes take effect immediately Replaces the LLM-in-chat text menu in the bicameral-config skill. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces text-based [1/2] menus with a single AskUserQuestion call covering mode, guided, and telemetry — all in one interactive prompt within the Claude session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…liberal ingest filter
Telemetry schema (all skills):
- g{N}_ prefix convention across all gate diagnostic fields (G2/G3/G6 in ingest,
G9/G10/G11 in preflight, G11 in capture-corrections)
- skill_begin/skill_end guarded: only emit if BICAMERAL_TELEMETRY is enabled
- g{N}_user_overrode as universal ground-truth signal at every interactive gate
AskUserQuestion ground truth wiring:
- G2 Step 1.5 (ingest): AskUserQuestion for borderline Gate1/Gate2 drops,
batched in groups of 4; guarded by guided_mode
- G10 Step 5.5 (preflight): AskUserQuestion after surfaced block to dismiss
irrelevant findings; guarded by guided_mode; populates g10_user_overrode
- G11 Steps 6-7 (capture-corrections): replaces freeform Y/n with
AskUserQuestion, batched in groups of 4 for all correction counts
Liberal ingest filter:
- Removed aspirational, hedged conditional, and parked/deferred from hard-exclude;
these now flow through level classification and gate filters as speculative proposals
- Ratification is the team's judgment layer, not the extraction filter
- Updated Example 1: now extracts 3 speculative proposals instead of 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was left at 0.12.2 — update handler checks this file to detect available upgrades. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After ingest, `bicameral sync` could return 'already_synced' with zero compliance checks when HEAD hadn't moved — leaving newly-ingested decisions stuck at `pending` indefinitely. Two-part fix: 1. `ledger/adapter.py` `ingest_commit`: in the `already_synced` early-return, query `get_pending_decisions_with_regions()` and include any pending decisions as `pending_compliance_checks` in the response. 2. `handlers/link_commit.py` `invalidate_sync_cache` + new `sync_middleware.invalidate_process_cache()`: after any mutation (ingest, update, reset), clear the process-level `_LAST_SYNCED_SHA` so that `ensure_ledger_synced` runs a fresh sync on the next tool call even when HEAD hasn't moved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep 7) Previously "after ingest" was ambiguous — LLM could fire the ratify AskUserQuestion immediately after bicameral.ingest returned, before the report (step 4), brief (step 5), and gap-judge (step 6) were shown. Now step 7 is explicit: - Must be the last user-facing output of the ingest flow - Multi-segment ingests ratify once at the end of the roll-up, not per segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(eval): cost-baseline harness — synthetic ledger + token counter + runner Stage 1-4 of issue BicameralAI#88 — measurement infrastructure for the catalog's §C cost/latency baseline. Three deterministic metrics: - C1: bicameral.history() payload tokens at N=10/100/1000 features - C2: bicameral.preflight() response size (tokens + bytes) - C3: handler latency p50/p95 on bicameral.preflight C2/C3 use mocked ledger queries so the metric isolates handler-logic + serialization cost from SurrealDB I/O variance. The optimization directions in BicameralAI#58 (semantic prefilter, lazy/two-pass history, etc.) all mutate handler logic, not the ledger. Asymmetric regression rule: only flags increases, never improvements. ±20% relative threshold with absolute noise floors (10 tokens / 0.5ms) to absorb timer jitter at sub-ms latency scale. Re-record via BICAMERAL_EVAL_RECORD_BASELINE=1 when the new value is intentional. The synthetic ledger generator is deterministic given (n_features, decisions_per_feature, seed); GENERATOR_VERSION tag in baseline rows forces re-record when the corpus changes. Token counter uses tiktoken cl100k_base — pinned in pyproject [test] extras to prevent silent count drift. 13 unit tests cover the regression rule + baseline IO directly. 5 runner tests produce the metrics on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(eval): commit initial Darwin cost baselines Five rows recorded on darwin/arm64 with Python 3.12.13 + tiktoken 0.12.0: - C1[N=10]: 7,574 tokens - C1[N=100]: 79,025 tokens - C1[N=1000]: 795,982 tokens - C2: 1,519 tokens / 6,610 bytes (representative shape — 10 region matches + 2 collision-pending + 2 context-pending) - C3: p50 ≈ 0.08ms, p95 ≈ 0.10ms (representative shape) The N=1000 number lands the §C concern empirically: ~800K tokens for a single bicameral.history() call fills 80% of Sonnet 4.6's 1M context before the skill reasons about anything. This is exactly the optimization target named in BicameralAI#58 (semantic prefilter, lazy/two-pass history, file-path → feature-group hint). Linux baselines NOT included — the runner skips cleanly per-platform when no row exists. Record locally on a Linux host with BICAMERAL_EVAL_RECORD_BASELINE=1 and commit the new rows in a follow-up. Token counts are platform-independent (deterministic via tiktoken) but still tagged recorded_on=darwin for symmetry with C3 latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+docs(preflight-eval): wire phase 3 cost/latency step + tick §C Adds the phase 3 step to the advisory preflight-eval workflow. continue-on-error: true so a phase 3 failure never blocks merge — same contract as phase 1 + 2. The existing test-summary glob (test-results/ *.xml) picks up the new junit file automatically. Catalog implementation queue ticked: C1/C2/C3 all marked baselined, with a pointer to tests/eval/cost_baseline.jsonl. Regression rule description updated to reflect the asymmetric + noise-floor design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etry LLMs were substituting natural-language names (grounded, ungrounded, channels_read, compliance_resolved) for the required g2_*/g3_*/g6_* prefixed names. The events landed in PostHog but fell through every dashboard panel because the queries filter on the prefixed names. Added explicit ⚠ warning with inline NOT comments (e.g. "# NOT 'grounded'") to both bicameral-ingest and bicameral-preflight skill_end sections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously diagnostic was an open object — LLMs sent improvised field names (grounded, ungrounded, channels_read) that fell through every dashboard filter. Now: - IngestDiagnostic and PreflightDiagnostic Pydantic models in contracts.py with extra="forbid" enumerate all valid g2_*/g3_*/g6_*/g9_*/g10_*/g11_* fields - skill_end handler validates against the per-skill model; unknown fields are stripped from the PostHog payload and echoed back in diagnostic_warning so the LLM immediately sees what it sent wrong on the same call - inputSchema description enumerates all valid field names so the LLM has them visible at call time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…field fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…3.3) onto dev Brings 22 commits from main into dev. Dev was ahead of main by 9 commits (Windows fixes BicameralAI#80–BicameralAI#84, doc BicameralAI#79, CodeGenome Phase 3+4 BicameralAI#73/BicameralAI#91) but missing main's parallel work: - v0.11.0 — CodeGenome Phase 1+2 release artifacts - v0.12.0 — skill-level telemetry (record_skill_event), extensible relay, reset wipe_mode - v0.12.1 — bicameral.feedback, error_class enum, rationale field - v0.12.2 — questionary CLI wizards (config, reset) - v0.13.0 — gate telemetry schema (g{N}_ prefix), AskUserQuestion ground truth wiring, liberal ingest filter (speculative proposals) - v0.13.1 — pending decisions surfaced when sync no-ops on same commit - v0.13.2 — ratify prompt ordering fix - v0.13.3 — Pydantic diagnostic enforcement + telemetry field fix Per maintainer policy, dev should always be ahead of main. After this merge, dev contains both telemetry work (from main) and Windows fixes + CodeGenome Phase 3+4 (already on dev). Future forward-merges to main include the dev-only work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> # Conflicts: # contracts.py
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings 22 commits from `main` (up to v0.13.3) onto `dev`. Per maintainer policy, dev should always be ahead of main; this merge restores that invariant after parallel work landed on both branches.
What dev is missing from main (this PR brings in)
record_skill_event), extensible relay, reset wipe_modebicameral.feedbacktool,error_classenum,rationalefieldWhat dev already has (preserved through the merge)
Conflict resolution
One conflict in `contracts.py` — both branches modified the pydantic import line. Resolved by combining: `from pydantic import BaseModel, ConfigDict, Field` (dev needed `Field` for line 278; main added `ConfigDict` for the new `IngestDiagnostic` / `PreflightDiagnostic` models). All new diagnostic model classes from main are preserved.
Test results
Full suite post-merge on Windows: 535 pass, 9 fail, 3 skip, 1 xfail.
The 9 failures are pre-existing issues unrelated to the merge:
Improvement vs pre-merge dev: the Windows subprocess fix from #84 (already on dev) eliminated the WinError 267 cluster, bringing pre-merge dev from 26 failures to 9.
Why this is a merge (not a rebase)
Both branches have merge commits from prior PRs (#73, #80–#84, #91 on dev; many on main). Rebasing dev onto main would flatten those merge commits and rewrite the hashes that referenced PRs reference. Merge preserves history without breaking external references.
🤖 Generated with Claude Code