triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision) by jinhongkuan · Pull Request #290 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-05-09T06:09:32Z

Summary

Triages all 25 commits from `dev` onto `main` — the gap since the v0.14.2 release. Bumps version v0.14.2 → v0.14.3.

Headline shipping

[v0-productization §2] Remote event-log adapter (Drive/S3/Dropbox) — shift team mode entirely off git #277 — team-mode remote event-log adapter (PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289). Pluggable `BackendAdapter` with two ship-day backends: `LocalFolderAdapter` (NFS / Dropbox / syncthing) and `GoogleDriveAdapter` (per-team Drive folder, bundled OAuth client). Setup wizard splits into Create / Join / LocalFolder branches. Pull-only sync; no Bicameral server in the loop. `drive.file` scope. Companion privacy page on bicameral-ai.com (BicameralAI/bicameral#111).
fix(tool,skill): caller-LLM grounding produces incorrect decision bindings (M2 regression) #280 — M2 grounding precision regression fix (PR fix(bind): reject caller-supplied lines that hallucinate symbols (#280 PR-1) #283) + eval harness (PR feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2) #284) + telemetry (PR feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3) #285). `handlers/bind.py` now rejects caller-supplied line ranges that don't match the resolved symbol (was silently accepting hallucinated symbol names on real files). M2 grounding-recall eval harness gates precision ≥ 0.85, recall ≥ 0.80, abort_rate ≤ 0.30 (warn-only initially, ratchet to hard once baseline is stable). PostHog telemetry surfaces ratification correctness for grounded binds.

Versions

	Before	After
`pyproject.toml`	0.14.2	0.14.3
`RECOMMENDED_VERSION`	0.14.1	0.14.3

Conflict resolution notes

CHANGELOG.md — kept dev's `[Unreleased]` block (rename → `v0.14.3`), inserted main's `v0.14.2` release entry below it.
README.md — kept dev's Solo-vs-Team mode section and extended setup-writes table (both added by PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289). Main was missing both because PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289 hadn't back-flowed yet.

Auto-resolved (no manual intervention needed)

`.github/workflows/test-mcp-regression.yml`, `pyproject.toml` (then manually bumped), `skills/bicameral-ingest/SKILL.md`, `tests/e2e/run_e2e_flows.py`, `tests/eval_decision_relevance.py`.

Test plan

CI: `ruff format --check .` green (was the failure mode on PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289)
CI: `mypy .` green (was the second failure on PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289)
CI: MCP Regression Suite green (ubuntu + windows)
CI: TruffleHog clean
Verify PyPI publish workflow recognizes v0.14.3 bump

What this does NOT include

Anything still WIP on dev that hasn't merged. As of this triage's creation, dev's HEAD is `e285c45` (PR feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277) #289 merge commit). Anything landing on dev after that needs a follow-up triage.

🤖 Generated with Claude Code

Visible changes when a user lands on the README: 1. Hero image (`assets/bicameral-hero.png`) at the very top — visual without/with comparison from the landing-page asset bundle. 2. Quickstart immediately after the one-line value prop (was buried below "The Problem" and "How It Feels"). User goes from "what is this" to "type these three lines" without scrolling. 3. Compliance section trimmed from a 12-line policy-file enumeration near the top to a 5-line "we take privacy seriously" paragraph at the bottom. The full posture stays linkable via docs/policies/. 4. pipx dropped from the install path. The two paths are now uv (recommended) and plain pip (fallback). uv was already preferred per #199's 3-path resolve order; pipx was middle ground that doesn't pull its weight in a top-level README. Section order: Hero → title → 1-liner → Quickstart → How It Feels → The Problem → Core Concepts → What setup installs → Slash Commands → MCP Tools Reference → Configuration → Local Development → Telemetry → Contributing → Privacy & Compliance → License 312 lines (was 376). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

27 in-progress planning artifacts (`plan-114-*.md`, `plan-codegenome-*.md`, `plan-A-*.md`, etc.) were tracked in the repo. They're working memory between author and reviewers during a feature; once the feature merges, the PR description + CHANGELOG carry the durable record. Keeping these in the public-ish repo: - adds 27 markdown files to the wheel's source-distribution surface for no end-user benefit - couples planning vocabulary to release artifacts (e.g. plan-codegenome files describe v1 work that #246 reverted; the plan stays useful as reference but doesn't belong on `main`) - creates churn pressure to mark plans as "done" or "superseded" instead of just letting them rest as the author's working notes This commit: 1. Adds `plan-*.md` pattern to `.gitignore` with a one-paragraph comment on the policy. 2. `git rm --cached` on all 27 currently-tracked `plan-*.md` files — they remain on local disk for the author's reference, just no longer tracked. After merge, anyone with a checkout will keep their local `plan-*.md` files; new plans drafted in-tree will be untracked by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a top-level SECURITY.md so GitHub auto-creates a "Security" tab in the repo nav bar — the closest GitHub-native surface to a "button that pulls up our SBOM/privacy statement." Contents: - **Privacy posture** — explicit "all data on your laptop" + telemetry opt-out + pointer to docs/policies/ for the full posture - **Software supply chain** — table of signed artifacts on each release (CycloneDX SBOM, Rekor attestation, hooks-manifest sigs, skills-manifest sigs, release-tag-commit sig); pointer to the release-evidence verification procedure; mention of GitHub's auto-SPDX SBOM under Insights → Dependency graph - **Supported versions** — only latest minor, ~30-day backport window for critical fixes - **Reporting a vulnerability** — GitHub Security Advisories preferred; jin@bicameral-ai.com fallback with `[security]` subject prefix; 3-day ack target, 30-day patch target for critical - **Scope** — what's in (server, skills, hooks, release pipeline) and what's out (third-party deps, host vulns, local-attack scenarios already covered by host-trust model) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sections removed (PM/dev evaluating Bicameral wants the test-out path, not the why/how-internals): - "The Problem" (long context narrative) - "Core Concepts" (two-axis model, link_commit, collab modes) - "Removing Bicameral" (post-test concern) - "Configuration" env-var table - "Local Development" - "Contributing" - "Telemetry" subsection (folded into Privacy & Compliance one-liner) What stays — the path from "land on the page" to "type three commands and see something work": - Hero image - Star-on-GitHub CTA (animated SVG, adapted from cocoindex's design with metadata strings updated; visual mechanism is generic) - Logo (small, inline-right of title) - Title + badges + 1-liner - Quickstart (uv | pip | Windows) - How It Feels (preflight render + dashboard screenshot) - Slash Commands - What `setup` writes (trust signal — what hits disk) - MCP Tools Reference (collapsed by default) - Privacy & Compliance (concise) - License 312 → 152 lines. Visuals from landing-page borrowed: assets/logo.png (was landing-page/logo.png), assets/bicameral-hero.png from landing-page/output/imagegen/. Star CTA SVG saved as assets/star-on-github.svg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(README): restructure for getting-started clarity + add hero image

… attr (#272) Closes 2 of 3 dev-baseline regressions tracked in #272 (the third — Flow 1 e2e expectation post-#263 auto-bind — is deferred for product-level discussion since it touches the ingest skill choreography). ## test-summary/action SHA-pin The `test-summary/action@v2` mutable tag was repointed on 2026-05-07 22:09 UTC from a `dist`-targeted release (with bundled `index.js`) to the new v2.5 release that targets `main` (no bundled output). Every PR opened after that point fails the Test Summary step with `File not found: index.js`, marking MCP Regression Suite jobs red on both ubuntu and windows. Pin to v2.4 commit `31493c76ec9e7aa675f1585d3ed6f1da69269a86` (the last `dist`-targeted release with the bundled artifact). Aligns with OWASP-03 / `docs/policies/install-trust-model.md` discipline (do not trust mutable action tags for security-load-bearing CI). ## tests/eval_decision_relevance.py:166 attribute rename `IngestResponse` schema field is `pending_grounding_decisions` per `contracts.py:571` and `handlers/ingest.py:695`. The eval script was still reading the legacy internal name `ungrounded_decisions` directly off the response, raising `AttributeError` and failing the M1 adversarial corpus eval step (technically `continue-on-error: true`, but worth fixing independently since the eval result was lost on every run). Keeps the JSON-output key `ungrounded_decisions` unchanged so downstream consumers (M1 trend reports, etc.) see no change. Both fixes verified locally: yaml.safe_load on the workflow + IngestResponse field introspection confirms the schema. Refs #272.

…in-and-eval-attribute fix(ci): SHA-pin test-summary action + rename eval_decision_relevance attr (#272)

…Fix 3) Closes the third regression deferred from PR #273. Post-#263 (sync auto-bind step 1.5), Flow 1's e2e exhausted budget on ~41 non-bicameral Read/Grep/validate_symbols calls before reaching the ingest skill's auto-fired ratify AskUserQuestion gate, so the agent never invoked ratify. Per #108 Flow 1 spec discussion: this is both a regression AND a canonical flow change. Auto-bind stays (deterministic, useful), but ratify drops out of the auto-prompt path and becomes advisory text. Ratification belongs to Flow 5 (PM Friday review) and to direct user requests like "sign these off" / "ratify all". Skill changes (skills/bicameral-ingest/SKILL.md Step 7): - Replace AskUserQuestion ratify gate with a one-block advisory: "○ N decisions captured as proposals — drift tracking activates after ratification. Run `bicameral.ratify` when ready, or revisit them in your next history review (Flow 5)." - Add explicit "Direct user request shortcut": if the prompt asked for sign-off in the same turn, ratify directly without the round-trip. - Update the example at the bottom to match. Test changes: - tests/e2e/prompts/flow-1-ingest.md: drop "sign them off on our end" so the prompt exercises the new advisory path. Direct sign-off requests are covered by Flow 5 (history + ratify). - tests/e2e/run_e2e_flows.py:assert_flow_1: drop the ratify requirement; accept [ingest, bind?] as the canonical Flow 1 signature. Update Flow 5's stale comment about Flow 1 pre-ratifying. - tests/test_e2e_asserters.py: invert test_flow1_fails_without_ratify → test_flow1_passes_without_ratify (advisory-only is now the expected behavior). Doesn't touch sync skill #263 — that auto-bind change is preserved intentionally; this PR fixes the test/spec contract that conflicted with it. Refs #272 #108 #263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ngest The previous fix dropped "sign them off on our end" to defeat the ratify auto-call; in doing so, it also removed the bicameral disambiguator. The remaining "please log these on our end" landed inside the auto-memory trigger surface (the runner's ~/.claude/projects/.../memory/ directory), so the agent wrote four memory files instead of invoking bicameral.ingest — Flow 1 went 0/0 on bicameral calls and the cascading flows (3, 5) then had nothing to assert against. Replace "log these on our end" with "log these to our decision ledger" — same intent, but "decision ledger" is an unambiguous bicameral signal the auto-memory skill does not match. Still no "sign them off" / "ratify" phrasing, so the new advisory-only Step 7 contract is preserved. Refs #272 #108. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The no-ratify success branch in assert_flow_5 still claimed "Flow 1 ratified its 3 seeds" — but per #272 Fix 3, Flow 1 now leaves seeds as `proposed` (advisory-only ingest). The docstring was updated in 2252c82; this catches the trailing f-string that was missed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(skill): ingest advises on ratify instead of auto-prompting (#272 Fix 3)

… M2 grounding precision) The silent-corruption surface for M2 grounding precision was one branch in handlers/bind.py: when a caller supplied start_line/end_line alongside symbol_name, the handler verified only that the file existed at the SHA and accepted any symbol_name — letting agents write binds_to edges to plausible-looking but wrong symbols whenever they hallucinated a real file with a fake symbol. Branch A (no lines) already ran tree-sitter and rejected on miss; Branch B was the asymmetric escape hatch. Branch B now also calls resolve_symbol_lines and rejects two cases: 1. symbol_name doesn't resolve at all → "symbol '<name>' not found in <file> at <sha> — caller-supplied line range cannot bypass symbol verification (#280)" 2. symbol resolves but caller-supplied span doesn't overlap the resolved span → "symbol '<name>' resolves at lines <a>-<b> but caller supplied <x>-<y> — span mismatch (#280)" Overlap (not exact equality) is the matching rule via the new _spans_overlap helper, so legitimate sub-region binds (e.g. pinning a specific clause inside a larger function body) stay accepted; only hallucinated ranges with no shared lines are rejected. Skill catalog reorganization (per CLAUDE.md skill-mandate): - New skills/bicameral-bind/SKILL.md extracts the bind contract out of skills/bicameral-ingest/SKILL.md §2 and tightens advisory rules to mandatory: Read at least one candidate file end-to-end, confirm symbol via validate_symbols, abort on weak evidence. Documents the handler-side rejection contract for agent visibility. - skills/bicameral-ingest/SKILL.md §2 reduced from ~38 inline lines to a 16-line pointer at the new bind skill — keeps ingest focused on extraction + filtering and matches the rest of the catalog (one tool ↔ one skill). Stale ground_mappings() refs cleaned up: - code_locator/tools/validate_symbols.py: dropped self._db field + L40-41 retention comment (referenced a v0.6.0-deleted path; field had zero readers). - tests/eval_decision_relevance.py:73 docstring updated to describe post-v0.6.0 caller-LLM grounding pipeline. Tests: - 3 existing tests (test_bind_success_with_explicit_lines, test_bind_idempotent, test_bind_status_transition) gain a resolve_symbol_lines mock since Branch B now exercises it. - 2 new tests (test_bind_branch_b_rejects_nonexistent_symbol, test_bind_branch_b_rejects_span_mismatch) cover the rejection paths. - _spans_overlap helper smoke-tested locally across 8 boundary cases. PR-1 of 3. Synthetic-recall eval (PR-2) and m2_grounding_* telemetry + dashboard (PR-3) follow per plan-280-grounding-precision-fix.md. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(bind): reject caller-supplied lines that hallucinate symbols (#280 PR-1)

…#280 PR-2) Synthetic-fixture benchmark that drives the bicameral-bind skill end-to-end against 23 cases across three failure modes — same-name-different-module, similar-intent-different-symbol, and cross-language. Measures three axes deliberately split for diagnosis: - precision = correct / (correct + wrong_symbol + wrong_file) - recall = correct / total_rows - abort_rate = aborted / total_rows The split matters: high-precision-low-recall = agent over-cautious; low- precision-high-recall = hallucinations the #280 PR-1 handler would now reject (handler_rejected outcome would surface as precision drag). Files tests/fixtures/grounding_recall/dataset.py 230 LOC 23 GroundingCase rows: 5 case-A (process_order × 3 modules, cancel_order × 2 modules), 10 case-B (rate-limit/throttle/retry/ auth/metrics intent disambiguation), 8 case-C (Python ↔ TS pairs). GENERATOR_VERSION constant invalidates the cache when bumped. Import-time _validate_dataset() fails loud on duplicate ids, invalid case_type, distractor === intended, etc. tests/fixtures/grounding_recall/repo/ 15 files / ~625 LOC Hand-crafted fixture repo with intended + distractor symbols. Each function/class body is short but real enough that the agent can actually distinguish behavior from keyword overlap (e.g. checkout/orders.py:process_order = customer flow w/ retry cap; admin/orders.py:process_order = manual replay of finance-flagged orders; billing/refunds.py:process_order = bulk-refund pipeline). tests/eval/_bind_judge.py 466 LOC Headless caller-LLM driver — modeled on tests/eval/_skill_judge.py. Multi-turn tool-use loop with 3 tools exposed: read_file, validate_symbols, submit_binding. Cap at 8 turns. Cache at tests/eval/fixtures/bind_judge/ keyed on SHA(model | bind_skill | repo | decision). Cache hits keep CI cost ~$0 unless dataset, fixture repo, or skill change. tests/eval_grounding_recall.py 256 LOC Argparse runner — modeled on tests/eval_decision_relevance.py. Loads dataset, drives _bind_judge per case, classifies outcome (correct / wrong_symbol / wrong_file / aborted), aggregates, emits JSON report, optional gate enforcement (--gate-mode warn|hard). .github/workflows/test-mcp-regression.yml +19 LOC New "M2 grounding-recall eval (warn-only)" step. Ubuntu-only, continue-on-error: true, mirrors the M1 step shape. ANTHROPIC_API_KEY from secrets, model env var, output to test-results/m2-grounding-recall.json. CHANGELOG.md +2 lines Default gates per #280 acceptance: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30. Ship warn-only first to record a post-PR-1 baseline, then ratchet to --gate-mode hard once the signal is stable. Same path the M1 eval has been on. Out of scope for PR-2 (per plan-280-grounding-precision-fix.md): - PR-3 ships PostHog m2_grounding_* events + dashboard panel - Friction capture (≥ 5 design-partner cases) is not engineering scope Local verification - dataset.py imports clean (23 cases, _validate_dataset() passes) - _bind_judge symbol indexer resolves all 11 spot-checked intended symbols including Class.method form - eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures (0 cases, gate breaches reported, exit 0 in warn mode as designed) Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five lint-side findings on the initial PR-2 commit, none of them runtime — fixing in place rather than amending the prior commit: - tests/eval/_bind_judge.py B007: add `# noqa: B007` to the `for turn in range(...)` loop. The loop variable IS used after the loop for telemetry (judgment_payload["turns"] = turn); suppression is more honest than renaming to `_turn` and losing the post-loop reference. - tests/eval/_bind_judge.py mypy: type-annotate `chosen_model: str` and tighten the `os.getenv` fallback chain so mypy can resolve `str | None` → `str`. Construct BindJudgment field-by-field instead of `**judgment_payload` so the dataclass field types are enforced (3× errors in the cached + write paths). - tests/eval_grounding_recall.py I001 + E402: per-line `# noqa: E402, I001` on the two local imports that must follow the sys.path inserts. Same shape `eval_decision_relevance.py` uses for its single post-path import. - tests/eval_grounding_recall.py F541: drop the f-prefix on `print(f" ✓ all gates pass")` (no placeholders). - tests/fixtures/grounding_recall/repo/src/checkout/orders.py B007: rename `for attempt in range(3):` → `for _attempt in range(3):` (loop body doesn't reference the counter). Plus `ruff format` reflowed 4 files (line wrapping, parens, exponent spacing) — no semantic changes. Local verification: ruff check + ruff format --check + mypy all green on the PR-2 surface (15 fixture files + 2 eval files). Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)

…PR-3) Three PostHog events now emit from the bind / ratification surfaces, plus a local mirror + dashboard panel that reads from it. Closes the last engineering piece of #280 (PR-3 of 3). Events ------ m2_grounding_attempt Fires per `handle_bind` per binding. Carries: - decision_source (controlled enum: transcript/spec/chat/manual/document) - diagnostic.success: bool — bound a region cleanly - diagnostic.handler_rejected: bool — true when #280 PR-1's reject path fired (caller hallucinated a wrong/non-existent symbol on a real file). The split between {success=False, handler_rejected=True} and {success=False, handler_rejected=False} tells operators whether the failure was the failsafe doing its job vs a ledger / IO bug. m2_grounding_ratified_correct (verdict == "compliant") m2_grounding_ratified_incorrect (verdict ∈ {"drifted", "not_relevant"}) Fire per accepted verdict in `handle_resolve_compliance`. Carry: - decision_source (same controlled enum) - diagnostic.confidence: int (low=0, medium=1, high=2) Privacy ------- The relay contract from telemetry.py:14-37 is non-negotiable: numeric/ bool diagnostics only, no decision_id / file_path / symbol_name. The new m2_grounding_log.py owns the split: - JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB rotation, 3 backups) carries decision_id for the dashboard panel's drill-down. Always written, regardless of relay consent. - PostHog relay sees only decision_source + numeric diagnostics — decision_id never crosses that boundary. A unit test (test_decision_id_never_relayed_to_posthog) pins this invariant. Files ----- m2_grounding_log.py (new, 241 LOC) Owner of the M2 event contract. record_attempt(), record_ratification(), read_recent_events(). Lazy-imports server + telemetry to break the handlers→server circular dependency at server-boot time. Test hook via BICAMERAL_M2_LOG_PATH env override (matches preflight_telemetry pattern). handlers/bind.py (+73) _emit_m2_attempt() helper at module scope. Wired to all five terminal paths in the per-binding loop where a decision_id is valid: Branch A symbol-not-found, Branch B file-not-found, the two #280 PR-1 reject paths, the bind_decision exception path, and the success path. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful. handlers/resolve_compliance.py (+40) _emit_m2_ratification() helper, called per accepted verdict. Wraps record_ratification() in try/except so a telemetry failure never breaks the verdict write. ledger/queries.py (+19) New get_decision_source() — single-field SELECT, returns the decision's source_type (controlled enum from the ingest contract). ledger/adapter.py (+10) Adapter delegation method. dashboard/server.py (+59) New GET /m2_grounding endpoint — aggregates the local mirror into rolling-7d per-source counts (attempts / rejects / ratified ✓ / ratified ✕) and computes precision. Read-only, no ledger I/O. assets/dashboard.html (+60) New "M2 grounding precision" panel below the main ledger view. Color-codes precision per source: green ≥ 85%, amber ≥ 70%, red below. Refreshes every 30s. CHANGELOG.md (+2) Unreleased entry covering all three events + the local mirror contract. Tests ----- tests/test_m2_grounding_log.py (9 tests, all green) Pure unit tests — no ledger dep. Cover JSONL row shape, verdict classification, time-window filtering, and the privacy invariant (decision_id never reaches the relay). tests/test_bind_m2_telemetry.py (4 tests + 3 skip-on-no-surrealdb) Helper-level: emit forwards args correctly, skips on empty decision_id, swallows telemetry failures fire-and-forget. Resolve-compliance verdict classification covered behind `pytest.importorskip("surrealdb")` since the handler module imports ledger.queries at top level — runs in CI, skipped local. Local verification ------------------ - 12 passed, 3 skipped on tests/test_m2_grounding_log.py + tests/test_bind_m2_telemetry.py - ruff check + ruff format --check + mypy all green on touched files (m2_grounding_log.py, handlers/bind.py, handlers/resolve_compliance.py, ledger/queries.py, ledger/adapter.py, dashboard/server.py, both new test files) What's NOT in this PR --------------------- Per plan-280-grounding-precision-fix.md: - Friction capture (≥ 5 design-partner cases) — design-partner work, not engineering scope. - PR-2 gate-flip (warn → hard) — separate small follow-up after PR-3 lands and we have a baseline reading. Aligns with Jin's "deliberate not drift" framing. - attempt_to_ratify_seconds field — deferred. Would need a `created_at` field on the binds_to edge (schema currently has only `confidence` + `provenance`); not worth a schema bump in this PR. Closes #280. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…CI instead (per Jin) Jin clarified the operator-dashboard scope: it's for users. M2 grounding precision is an engineering quality metric, not user-facing. Reverting the dashboard pieces; adding GitHub Actions step-summary surfacing which is where engineers actually look for these numbers. Reverted from PR-3's initial shape ---------------------------------- assets/dashboard.html - Drop the <section id="m2-panel"> block + the renderM2 / loadM2 / setInterval JS. Dashboard returns to pre-#280 user view. dashboard/server.py - Drop the GET /m2_grounding route + _serve_m2_grounding handler. m2_grounding_log.py - Drop read_recent_events() (only consumer was _serve_m2_grounding; now dead code per Jin's "avoid bloat unless product-justified"). - Drop now-unused `time` import. tests/test_m2_grounding_log.py - Drop test_read_recent_events_respects_window (function gone) and now-unused `os` import. Added (the new piece) --------------------- tests/eval_grounding_recall_summary.py (new) Renders the PR-2 eval JSON (test-results/m2-grounding-recall.json) as a markdown block — precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall table, gate-breach line, expandable miss-list capped at 25 rows. Fail-quiet: missing/malformed JSON degrades to a one-line note rather than failing CI. .github/workflows/test-mcp-regression.yml (+10) New "M2 metrics summary" step after the M2 eval. Pipes the renderer's stdout to $GITHUB_STEP_SUMMARY so the metrics show on the GitHub Actions run page without needing the artifact download. always() guard so the summary appears even when the eval step above warns. continue-on-error keeps it advisory. Kept from PR-3's initial shape ------------------------------ - The three PostHog events from handle_bind / handle_resolve_compliance. - The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl (operator support + diagnose CLI surface; never relayed). - The m2_grounding_log.py module's record_attempt / record_ratification public API. - All telemetry tests (privacy invariant pin still holds). Net Δ on PR-3: -119 LOC dashboard pieces, +210 LOC summary renderer + workflow step. Tests: 11 passed, 3 skipped (resolve_compliance import-or-skip). Ruff + ruff format + mypy all green. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous revert left an extra blank line where the `<section id="m2-panel">` block lived. Removes it so assets/dashboard.html is byte-identical to origin/dev — confirming Jin's "don't change the user dashboard" intent verbatim. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)

…nds (#277) Closes #277. Implements v0 Productization §2: shifts team mode entirely off git as the inter-machine replication substrate, onto a pluggable backend with two ship-day implementations (LocalFolder, GoogleDrive). Pull-only sync; no daemons, no webhooks, no Bicameral server in the loop. What changes for users - Setup wizard team-mode branch now offers Create vs Join vs LocalFolder. Create: provisions a Drive folder under the operator's Google account, prints the literal share-text-to-teammates message. Join: paste folder ID/URL, OAuth, verify access (404 / read-only both block), confirm the resolved signer (default-No) before persisting. LocalFolder: single prompt for the path. - Drive integration uses Bicameral's bundled OAuth client (the same pattern gh / gcloud / cursor use). Scope: drive.file only — Bicameral's CLI can only see files it creates inside the team folder. Token cache at ~/.bicameral/google-drive-token.json mode 0600. - Colored security disclosure renders before the browser opens, walking the operator through what flows where, what we do and don't see, and the trust dependency. Mirrored on bicameral-ai.com/privacy (BicameralAI/bicameral PR #111). Architecture - events/backends/__init__.py — BackendAdapter ABC + get_backend factory. - events/backends/local_folder.py — sha256-idempotent LocalFolderAdapter. - events/backends/google_drive.py — Drive Files API adapter; bundled client_id + client_secret (RFC 8252 native-app pattern, no env override per Option A); FolderNotFoundError / ReadOnlyAccessError surface for Join verify_access; create_folder helper for Create branch. - events/team_adapter.py — TeamWriteAdapter accepts backend=, marks _dirty on every write, exposes flush_to_backend(). - adapters/ledger.py — _read_collaboration_mode refactored to _read_team_config(repo_path) -> dict; constructs backend and injects into TeamWriteAdapter. - handlers/sync_middleware.py — ensure_team_synced (30 s TTL pull) + flush_team_writes (post-handler push); errors swallowed at DEBUG. - server.py — wires both into the dispatch site (pull at top, flush in finally). - setup_wizard.py — Create/Join/LocalFolder dispatch + colored security disclosure + identity-confirmation prompt at Join time. Testing - 53 new tests, 1 platform-skip (Windows-only path): - LocalFolderAdapter: 6 tests (push idempotency, pull peer-files-only, list_peers, lock serialization) - TeamWriteAdapter ↔ backend: 3 tests (connect-pulls-then-replays, write-marks-dirty-then-flush-pushes, no-backend-noop) - Two-author round-trip: 2 tests - Sync middleware: 5 tests (TTL cache, no-backend-noop, error swallowing) - GoogleDriveAdapter: 11 tests (push idempotency on md5, pull own-file-skip + max-modifiedTime token, lock create-then-delete + cleanup on exception, verify_access 404 / read-only / can-edit, create_folder, placeholder-detection auto-skip when bundled client is published) - Setup wizard Create/Join: 11 tests including identity decline, OAuth-disclosure decline, folder-id URL extraction, unwritable-path rejection - All adjacent regression tests still pass (test_team_event_replay, test_event_writer). - Lint clean across events/ adapters/ handlers/sync_middleware.py setup_wizard.py + new test files. Security model (also documented at docs/team-mode-setup.md and on bicameral-ai.com/privacy) - Decision data flows your-CLI ↔ Google directly. Bicameral the company does NOT receive copies. No Bicameral server in the loop. - drive.file scope limits the CLI on the user's machine to files it creates in the team folder. The rest of the user's Drive is invisible to the CLI; Google enforces this server-side. - As OAuth app publisher, Bicameral receives aggregate API request counts and per-user OAuth consent records (which Google accounts authenticated, when). Not contents. - Trust dependency: same as any OAuth tool (gh, gcloud, Notion, Slack desktop) — open-source CLI behaves as advertised, mitigated by source visibility. OAuth verification submission text + GCP setup checklist: docs/google-oauth-verification-submission.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure formatting — `ruff format` against the 10 files touched in #277. No semantic changes. CI's `ruff format --check .` now passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…endAdapter Mypy was failing on `events/backends/__init__.py:62,66` — the factory's return type is `BackendAdapter | None`, but the two concrete adapters were structurally compatible without declaring inheritance. Added explicit `BackendAdapter` base. Both classes already implemented all four abstract methods (push_events, pull_events, lock, list_peers) — runtime check (issubclass + concrete instantiation) passes. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277)

Triages 25 dev commits onto main (already on dev as of merge time): • #289 — team-mode remote event-log adapter (#277) • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280) • #275 — README/SECURITY surface • plus assorted fixes flowing through dev Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block, inserted v0.14.2's release entry from main below it, then renamed [Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode section + extended setup-writes table from #289 — main was missing both because PR #289 hadn't backflowed yet). pyproject.toml: 0.14.2 → 0.14.3 RECOMMENDED_VERSION: 0.14.1 → 0.14.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-09T06:09:45Z

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 59 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a569177-8f43-4260-a6ba-cb844a4eac0a

📥 Commits

Reviewing files that changed from the base of the PR and between 233d463 and af2873e.

📒 Files selected for processing (54)

.github/workflows/test-mcp-regression.yml
CHANGELOG.md
README.md
RECOMMENDED_VERSION
adapters/ledger.py
code_locator/tools/validate_symbols.py
docs/google-oauth-verification-submission.md
docs/team-mode-setup.md
events/backends/__init__.py
events/backends/google_drive.py
events/backends/local_folder.py
events/team_adapter.py
handlers/bind.py
handlers/resolve_compliance.py
handlers/sync_middleware.py
ledger/adapter.py
ledger/queries.py
m2_grounding_log.py
pyproject.toml
requirements.txt
server.py
setup_wizard.py
skills/bicameral-bind/SKILL.md
skills/bicameral-ingest/SKILL.md
tests/eval/_bind_judge.py
tests/eval_decision_relevance.py
tests/eval_grounding_recall.py
tests/eval_grounding_recall_summary.py
tests/fixtures/grounding_recall/dataset.py
tests/fixtures/grounding_recall/repo/src/admin/orders.py
tests/fixtures/grounding_recall/repo/src/auth/session.py
tests/fixtures/grounding_recall/repo/src/auth/tokens.py
tests/fixtures/grounding_recall/repo/src/billing/refunds.py
tests/fixtures/grounding_recall/repo/src/checkout/orders.py
tests/fixtures/grounding_recall/repo/src/checkout/retry.py
tests/fixtures/grounding_recall/repo/src/checkout/throttle.py
tests/fixtures/grounding_recall/repo/src/metrics/collect.py
tests/fixtures/grounding_recall/repo/src/metrics/collect.ts
tests/fixtures/grounding_recall/repo/src/middleware/global_rate_limit.py
tests/fixtures/grounding_recall/repo/src/middleware/tenant_rate_limit.py
tests/fixtures/grounding_recall/repo/src/webhooks/dispatch.py
tests/fixtures/grounding_recall/repo/src/webhooks/dispatch.ts
tests/fixtures/grounding_recall/repo/src/webhooks/verify.py
tests/fixtures/grounding_recall/repo/src/webhooks/verify.ts
tests/test_backends_google_drive_unit.py
tests/test_backends_local_folder.py
tests/test_bind.py
tests/test_bind_m2_telemetry.py
tests/test_m2_grounding_log.py
tests/test_setup_wizard_team_backend.py
tests/test_sync_middleware_team.py
tests/test_team_adapter_with_backend.py
tests/test_team_round_trip_local_folder.py
thoughts/shared/plans/2026-05-08-remote-event-log-adapter.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch triage-from-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…tor subclass shape Mypy was failing on triage PR #290: events/backends/local_folder.py:72: error: Return type "AsyncIterator[str]" of "list_peers" incompatible with return type "Coroutine[Any, Any, AsyncIterator[str]]" in supertype "BackendAdapter" Both concrete adapters implement list_peers as async generators (`async def ... yield`), which return AsyncIterator[str] directly. The ABC's `async def` declaration typed it as Coroutine[..., AsyncIterator[str]] — a different shape. Per mypy docs (more_types.html#asynchronous-iterators), async-iterator methods should be declared `def -> AsyncIterator[T]` in the supertype. Concrete implementations unchanged; tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jinhongkuan and others added 26 commits May 7, 2026 16:35

Merge pull request #270 from BicameralAI/feat/readme-restructure

c664e86

docs(README): restructure for getting-started clarity + add hero image

Merge pull request #273 from BicameralAI/fix/272-ci-baseline-action-p…

d7e7357

…in-and-eval-attribute fix(ci): SHA-pin test-summary action + rename eval_decision_relevance attr (#272)

style: ruff format assert_flow_1 ratify_note line

b453da8

Merge pull request #276 from BicameralAI/fix/272-flow-1-advisory-ratify

9b2b128

fix(skill): ingest advises on ratify instead of auto-prompting (#272 Fix 3)

Merge pull request #283 from BicameralAI/280-grounding-precision-fix

6c4a1c5

fix(bind): reject caller-supplied lines that hallucinate symbols (#280 PR-1)

Merge pull request #284 from BicameralAI/280-grounding-eval-harness

f8cd9ee

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)

Merge pull request #285 from BicameralAI/280-grounding-telemetry

58f0efa

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)

style: ruff format the #277 surface

a366cb5

Pure formatting — `ruff format` against the 10 files touched in #277. No semantic changes. CI's `ruff format --check .` now passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #289 from BicameralAI/277-remote-event-log-adapter

e285c45

feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277)

jinhongkuan temporarily deployed to ci-test May 9, 2026 06:09 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 9, 2026 06:09 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to recording-approval May 9, 2026 06:09 — with GitHub Actions Failure

jinhongkuan temporarily deployed to ci-test May 9, 2026 06:09 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 9, 2026 06:12 — with GitHub Actions Inactive

jinhongkuan deployed to recording-approval May 9, 2026 06:12 — with GitHub Actions Active

jinhongkuan temporarily deployed to production May 9, 2026 06:12 — with GitHub Actions Inactive

jinhongkuan merged commit e42c4b6 into main May 9, 2026
8 checks passed

jinhongkuan mentioned this pull request May 10, 2026

triage→main: resilient ledger migration + bicameral_diagnose + reset --replay-from-events (#296) + README demo videos & opener rewrite (#299) #298

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision)#290

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision)#290
jinhongkuan merged 27 commits into
mainfrom
triage-from-dev

jinhongkuan commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jinhongkuan commented May 9, 2026

Summary

Headline shipping

Versions

Conflict resolution notes

Auto-resolved (no manual intervention needed)

Test plan

What this does NOT include

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 9, 2026 •

edited

Loading