Skip to content

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision)#290

Merged
jinhongkuan merged 27 commits into
mainfrom
triage-from-dev
May 9, 2026
Merged

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision)#290
jinhongkuan merged 27 commits into
mainfrom
triage-from-dev

Conversation

@jinhongkuan

Copy link
Copy Markdown
Contributor

Summary

Triages all 25 commits from `dev` onto `main` — the gap since the v0.14.2 release. Bumps version v0.14.2 → v0.14.3.

Headline shipping

Versions

Before After
`pyproject.toml` 0.14.2 0.14.3
`RECOMMENDED_VERSION` 0.14.1 0.14.3

Conflict resolution notes

Auto-resolved (no manual intervention needed)

`.github/workflows/test-mcp-regression.yml`, `pyproject.toml` (then manually bumped), `skills/bicameral-ingest/SKILL.md`, `tests/e2e/run_e2e_flows.py`, `tests/eval_decision_relevance.py`.

Test plan

What this does NOT include

🤖 Generated with Claude Code

jinhongkuan and others added 26 commits May 7, 2026 16:35
Visible changes when a user lands on the README:

1. Hero image (`assets/bicameral-hero.png`) at the very top — visual
   without/with comparison from the landing-page asset bundle.
2. Quickstart immediately after the one-line value prop (was buried
   below "The Problem" and "How It Feels"). User goes from "what is
   this" to "type these three lines" without scrolling.
3. Compliance section trimmed from a 12-line policy-file enumeration
   near the top to a 5-line "we take privacy seriously" paragraph at
   the bottom. The full posture stays linkable via docs/policies/.
4. pipx dropped from the install path. The two paths are now uv
   (recommended) and plain pip (fallback). uv was already preferred
   per #199's 3-path resolve order; pipx was middle ground that
   doesn't pull its weight in a top-level README.

Section order:
  Hero → title → 1-liner →
  Quickstart → How It Feels → The Problem → Core Concepts →
  What setup installs → Slash Commands → MCP Tools Reference →
  Configuration → Local Development → Telemetry →
  Contributing → Privacy & Compliance → License

312 lines (was 376).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27 in-progress planning artifacts (`plan-114-*.md`, `plan-codegenome-*.md`,
`plan-A-*.md`, etc.) were tracked in the repo. They're working memory
between author and reviewers during a feature; once the feature merges,
the PR description + CHANGELOG carry the durable record. Keeping these
in the public-ish repo:

- adds 27 markdown files to the wheel's source-distribution surface for
  no end-user benefit
- couples planning vocabulary to release artifacts (e.g. plan-codegenome
  files describe v1 work that #246 reverted; the plan stays useful as
  reference but doesn't belong on `main`)
- creates churn pressure to mark plans as "done" or "superseded" instead
  of just letting them rest as the author's working notes

This commit:
1. Adds `plan-*.md` pattern to `.gitignore` with a one-paragraph comment
   on the policy.
2. `git rm --cached` on all 27 currently-tracked `plan-*.md` files —
   they remain on local disk for the author's reference, just no longer
   tracked.

After merge, anyone with a checkout will keep their local `plan-*.md`
files; new plans drafted in-tree will be untracked by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a top-level SECURITY.md so GitHub auto-creates a "Security" tab
in the repo nav bar — the closest GitHub-native surface to a "button
that pulls up our SBOM/privacy statement."

Contents:

- **Privacy posture** — explicit "all data on your laptop" + telemetry
  opt-out + pointer to docs/policies/ for the full posture
- **Software supply chain** — table of signed artifacts on each
  release (CycloneDX SBOM, Rekor attestation, hooks-manifest sigs,
  skills-manifest sigs, release-tag-commit sig); pointer to the
  release-evidence verification procedure; mention of GitHub's
  auto-SPDX SBOM under Insights → Dependency graph
- **Supported versions** — only latest minor, ~30-day backport window
  for critical fixes
- **Reporting a vulnerability** — GitHub Security Advisories preferred;
  jin@bicameral-ai.com fallback with `[security]` subject prefix; 3-day
  ack target, 30-day patch target for critical
- **Scope** — what's in (server, skills, hooks, release pipeline) and
  what's out (third-party deps, host vulns, local-attack scenarios
  already covered by host-trust model)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sections removed (PM/dev evaluating Bicameral wants the test-out path,
not the why/how-internals):
- "The Problem" (long context narrative)
- "Core Concepts" (two-axis model, link_commit, collab modes)
- "Removing Bicameral" (post-test concern)
- "Configuration" env-var table
- "Local Development"
- "Contributing"
- "Telemetry" subsection (folded into Privacy & Compliance one-liner)

What stays — the path from "land on the page" to "type three commands
and see something work":
- Hero image
- Star-on-GitHub CTA (animated SVG, adapted from cocoindex's design
  with metadata strings updated; visual mechanism is generic)
- Logo (small, inline-right of title)
- Title + badges + 1-liner
- Quickstart (uv | pip | Windows)
- How It Feels (preflight render + dashboard screenshot)
- Slash Commands
- What `setup` writes (trust signal — what hits disk)
- MCP Tools Reference (collapsed by default)
- Privacy & Compliance (concise)
- License

312 → 152 lines.

Visuals from landing-page borrowed: assets/logo.png (was
landing-page/logo.png), assets/bicameral-hero.png from
landing-page/output/imagegen/. Star CTA SVG saved as
assets/star-on-github.svg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs(README): restructure for getting-started clarity + add hero image
… attr (#272)

Closes 2 of 3 dev-baseline regressions tracked in #272 (the third — Flow 1
e2e expectation post-#263 auto-bind — is deferred for product-level
discussion since it touches the ingest skill choreography).

## test-summary/action SHA-pin

The `test-summary/action@v2` mutable tag was repointed on 2026-05-07
22:09 UTC from a `dist`-targeted release (with bundled `index.js`) to
the new v2.5 release that targets `main` (no bundled output). Every PR
opened after that point fails the Test Summary step with
`File not found: index.js`, marking MCP Regression Suite jobs red on
both ubuntu and windows.

Pin to v2.4 commit `31493c76ec9e7aa675f1585d3ed6f1da69269a86` (the
last `dist`-targeted release with the bundled artifact). Aligns with
OWASP-03 / `docs/policies/install-trust-model.md` discipline (do not
trust mutable action tags for security-load-bearing CI).

## tests/eval_decision_relevance.py:166 attribute rename

`IngestResponse` schema field is `pending_grounding_decisions` per
`contracts.py:571` and `handlers/ingest.py:695`. The eval script was
still reading the legacy internal name `ungrounded_decisions` directly
off the response, raising `AttributeError` and failing the M1
adversarial corpus eval step (technically `continue-on-error: true`,
but worth fixing independently since the eval result was lost on every
run).

Keeps the JSON-output key `ungrounded_decisions` unchanged so downstream
consumers (M1 trend reports, etc.) see no change.

Both fixes verified locally: yaml.safe_load on the workflow + IngestResponse
field introspection confirms the schema.

Refs #272.
…in-and-eval-attribute

fix(ci): SHA-pin test-summary action + rename eval_decision_relevance attr (#272)
…Fix 3)

Closes the third regression deferred from PR #273. Post-#263 (sync
auto-bind step 1.5), Flow 1's e2e exhausted budget on ~41 non-bicameral
Read/Grep/validate_symbols calls before reaching the ingest skill's
auto-fired ratify AskUserQuestion gate, so the agent never invoked
ratify.

Per #108 Flow 1 spec discussion: this is both a regression AND a
canonical flow change. Auto-bind stays (deterministic, useful), but
ratify drops out of the auto-prompt path and becomes advisory text.
Ratification belongs to Flow 5 (PM Friday review) and to direct user
requests like "sign these off" / "ratify all".

Skill changes (skills/bicameral-ingest/SKILL.md Step 7):

- Replace AskUserQuestion ratify gate with a one-block advisory:
  "○ N decisions captured as proposals — drift tracking activates
   after ratification. Run `bicameral.ratify` when ready, or revisit
   them in your next history review (Flow 5)."
- Add explicit "Direct user request shortcut": if the prompt asked
  for sign-off in the same turn, ratify directly without the round-trip.
- Update the example at the bottom to match.

Test changes:

- tests/e2e/prompts/flow-1-ingest.md: drop "sign them off on our end"
  so the prompt exercises the new advisory path. Direct sign-off
  requests are covered by Flow 5 (history + ratify).
- tests/e2e/run_e2e_flows.py:assert_flow_1: drop the ratify
  requirement; accept [ingest, bind?] as the canonical Flow 1
  signature. Update Flow 5's stale comment about Flow 1 pre-ratifying.
- tests/test_e2e_asserters.py: invert test_flow1_fails_without_ratify
  → test_flow1_passes_without_ratify (advisory-only is now the
  expected behavior).

Doesn't touch sync skill #263 — that auto-bind change is preserved
intentionally; this PR fixes the test/spec contract that conflicted
with it.

Refs #272 #108 #263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ngest

The previous fix dropped "sign them off on our end" to defeat the ratify
auto-call; in doing so, it also removed the bicameral disambiguator. The
remaining "please log these on our end" landed inside the auto-memory
trigger surface (the runner's ~/.claude/projects/.../memory/ directory),
so the agent wrote four memory files instead of invoking bicameral.ingest
— Flow 1 went 0/0 on bicameral calls and the cascading flows (3, 5)
then had nothing to assert against.

Replace "log these on our end" with "log these to our decision ledger"
— same intent, but "decision ledger" is an unambiguous bicameral signal
the auto-memory skill does not match. Still no "sign them off" / "ratify"
phrasing, so the new advisory-only Step 7 contract is preserved.

Refs #272 #108.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The no-ratify success branch in assert_flow_5 still claimed "Flow 1
ratified its 3 seeds" — but per #272 Fix 3, Flow 1 now leaves seeds
as `proposed` (advisory-only ingest). The docstring was updated in
2252c82; this catches the trailing f-string that was missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(skill): ingest advises on ratify instead of auto-prompting (#272 Fix 3)
… M2 grounding precision)

The silent-corruption surface for M2 grounding precision was one branch
in handlers/bind.py: when a caller supplied start_line/end_line alongside
symbol_name, the handler verified only that the file existed at the SHA
and accepted any symbol_name — letting agents write binds_to edges to
plausible-looking but wrong symbols whenever they hallucinated a real
file with a fake symbol. Branch A (no lines) already ran tree-sitter and
rejected on miss; Branch B was the asymmetric escape hatch.

Branch B now also calls resolve_symbol_lines and rejects two cases:

  1. symbol_name doesn't resolve at all
     → "symbol '<name>' not found in <file> at <sha> — caller-supplied
        line range cannot bypass symbol verification (#280)"

  2. symbol resolves but caller-supplied span doesn't overlap the
     resolved span
     → "symbol '<name>' resolves at lines <a>-<b> but caller supplied
        <x>-<y> — span mismatch (#280)"

Overlap (not exact equality) is the matching rule via the new
_spans_overlap helper, so legitimate sub-region binds (e.g. pinning a
specific clause inside a larger function body) stay accepted; only
hallucinated ranges with no shared lines are rejected.

Skill catalog reorganization (per CLAUDE.md skill-mandate):

  - New skills/bicameral-bind/SKILL.md extracts the bind contract out
    of skills/bicameral-ingest/SKILL.md §2 and tightens advisory rules
    to mandatory: Read at least one candidate file end-to-end, confirm
    symbol via validate_symbols, abort on weak evidence. Documents the
    handler-side rejection contract for agent visibility.
  - skills/bicameral-ingest/SKILL.md §2 reduced from ~38 inline lines
    to a 16-line pointer at the new bind skill — keeps ingest focused
    on extraction + filtering and matches the rest of the catalog
    (one tool ↔ one skill).

Stale ground_mappings() refs cleaned up:

  - code_locator/tools/validate_symbols.py: dropped self._db field +
    L40-41 retention comment (referenced a v0.6.0-deleted path; field
    had zero readers).
  - tests/eval_decision_relevance.py:73 docstring updated to describe
    post-v0.6.0 caller-LLM grounding pipeline.

Tests:

  - 3 existing tests (test_bind_success_with_explicit_lines,
    test_bind_idempotent, test_bind_status_transition) gain a
    resolve_symbol_lines mock since Branch B now exercises it.
  - 2 new tests (test_bind_branch_b_rejects_nonexistent_symbol,
    test_bind_branch_b_rejects_span_mismatch) cover the rejection paths.
  - _spans_overlap helper smoke-tested locally across 8 boundary cases.

PR-1 of 3. Synthetic-recall eval (PR-2) and m2_grounding_* telemetry +
dashboard (PR-3) follow per plan-280-grounding-precision-fix.md.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(bind): reject caller-supplied lines that hallucinate symbols (#280 PR-1)
…#280 PR-2)

Synthetic-fixture benchmark that drives the bicameral-bind skill end-to-end
against 23 cases across three failure modes — same-name-different-module,
similar-intent-different-symbol, and cross-language. Measures three axes
deliberately split for diagnosis:

  - precision  = correct / (correct + wrong_symbol + wrong_file)
  - recall     = correct / total_rows
  - abort_rate = aborted / total_rows

The split matters: high-precision-low-recall = agent over-cautious; low-
precision-high-recall = hallucinations the #280 PR-1 handler would now
reject (handler_rejected outcome would surface as precision drag).

Files

  tests/fixtures/grounding_recall/dataset.py          230 LOC
    23 GroundingCase rows: 5 case-A (process_order × 3 modules,
    cancel_order × 2 modules), 10 case-B (rate-limit/throttle/retry/
    auth/metrics intent disambiguation), 8 case-C (Python ↔ TS pairs).
    GENERATOR_VERSION constant invalidates the cache when bumped.
    Import-time _validate_dataset() fails loud on duplicate ids,
    invalid case_type, distractor === intended, etc.

  tests/fixtures/grounding_recall/repo/               15 files / ~625 LOC
    Hand-crafted fixture repo with intended + distractor symbols.
    Each function/class body is short but real enough that the agent
    can actually distinguish behavior from keyword overlap (e.g.
    checkout/orders.py:process_order = customer flow w/ retry cap;
    admin/orders.py:process_order = manual replay of finance-flagged
    orders; billing/refunds.py:process_order = bulk-refund pipeline).

  tests/eval/_bind_judge.py                           466 LOC
    Headless caller-LLM driver — modeled on tests/eval/_skill_judge.py.
    Multi-turn tool-use loop with 3 tools exposed: read_file,
    validate_symbols, submit_binding. Cap at 8 turns. Cache at
    tests/eval/fixtures/bind_judge/ keyed on
    SHA(model | bind_skill | repo | decision). Cache hits keep CI
    cost ~$0 unless dataset, fixture repo, or skill change.

  tests/eval_grounding_recall.py                      256 LOC
    Argparse runner — modeled on tests/eval_decision_relevance.py.
    Loads dataset, drives _bind_judge per case, classifies outcome
    (correct / wrong_symbol / wrong_file / aborted), aggregates,
    emits JSON report, optional gate enforcement (--gate-mode warn|hard).

  .github/workflows/test-mcp-regression.yml           +19 LOC
    New "M2 grounding-recall eval (warn-only)" step. Ubuntu-only,
    continue-on-error: true, mirrors the M1 step shape. ANTHROPIC_API_KEY
    from secrets, model env var, output to test-results/m2-grounding-recall.json.

  CHANGELOG.md                                        +2 lines

Default gates per #280 acceptance: recall ≥ 0.80, precision ≥ 0.85,
abort_rate ≤ 0.30. Ship warn-only first to record a post-PR-1 baseline,
then ratchet to --gate-mode hard once the signal is stable. Same path
the M1 eval has been on.

Out of scope for PR-2 (per plan-280-grounding-precision-fix.md):

  - PR-3 ships PostHog m2_grounding_* events + dashboard panel
  - Friction capture (≥ 5 design-partner cases) is not engineering scope

Local verification

  - dataset.py imports clean (23 cases, _validate_dataset() passes)
  - _bind_judge symbol indexer resolves all 11 spot-checked intended
    symbols including Class.method form
  - eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures
    (0 cases, gate breaches reported, exit 0 in warn mode as designed)

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five lint-side findings on the initial PR-2 commit, none of them
runtime — fixing in place rather than amending the prior commit:

  - tests/eval/_bind_judge.py B007: add `# noqa: B007` to the
    `for turn in range(...)` loop. The loop variable IS used after
    the loop for telemetry (judgment_payload["turns"] = turn);
    suppression is more honest than renaming to `_turn` and losing
    the post-loop reference.

  - tests/eval/_bind_judge.py mypy: type-annotate `chosen_model: str`
    and tighten the `os.getenv` fallback chain so mypy can resolve
    `str | None` → `str`. Construct BindJudgment field-by-field
    instead of `**judgment_payload` so the dataclass field types
    are enforced (3× errors in the cached + write paths).

  - tests/eval_grounding_recall.py I001 + E402: per-line
    `# noqa: E402, I001` on the two local imports that must follow
    the sys.path inserts. Same shape `eval_decision_relevance.py`
    uses for its single post-path import.

  - tests/eval_grounding_recall.py F541: drop the f-prefix on
    `print(f"  ✓ all gates pass")` (no placeholders).

  - tests/fixtures/grounding_recall/repo/src/checkout/orders.py B007:
    rename `for attempt in range(3):` → `for _attempt in range(3):`
    (loop body doesn't reference the counter).

Plus `ruff format` reflowed 4 files (line wrapping, parens, exponent
spacing) — no semantic changes.

Local verification: ruff check + ruff format --check + mypy all
green on the PR-2 surface (15 fixture files + 2 eval files).

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)
…PR-3)

Three PostHog events now emit from the bind / ratification surfaces,
plus a local mirror + dashboard panel that reads from it. Closes the
last engineering piece of #280 (PR-3 of 3).

Events
------

  m2_grounding_attempt
    Fires per `handle_bind` per binding. Carries:
      - decision_source (controlled enum: transcript/spec/chat/manual/document)
      - diagnostic.success: bool — bound a region cleanly
      - diagnostic.handler_rejected: bool — true when #280 PR-1's
        reject path fired (caller hallucinated a wrong/non-existent
        symbol on a real file). The split between {success=False,
        handler_rejected=True} and {success=False, handler_rejected=False}
        tells operators whether the failure was the failsafe doing its
        job vs a ledger / IO bug.

  m2_grounding_ratified_correct  (verdict == "compliant")
  m2_grounding_ratified_incorrect (verdict ∈ {"drifted", "not_relevant"})
    Fire per accepted verdict in `handle_resolve_compliance`. Carry:
      - decision_source (same controlled enum)
      - diagnostic.confidence: int (low=0, medium=1, high=2)

Privacy
-------

The relay contract from telemetry.py:14-37 is non-negotiable: numeric/
bool diagnostics only, no decision_id / file_path / symbol_name. The
new m2_grounding_log.py owns the split:

  - JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB
    rotation, 3 backups) carries decision_id for the dashboard panel's
    drill-down. Always written, regardless of relay consent.
  - PostHog relay sees only decision_source + numeric diagnostics —
    decision_id never crosses that boundary. A unit test
    (test_decision_id_never_relayed_to_posthog) pins this invariant.

Files
-----

  m2_grounding_log.py (new, 241 LOC)
    Owner of the M2 event contract. record_attempt(), record_ratification(),
    read_recent_events(). Lazy-imports server + telemetry to break the
    handlers→server circular dependency at server-boot time. Test hook
    via BICAMERAL_M2_LOG_PATH env override (matches preflight_telemetry
    pattern).

  handlers/bind.py (+73)
    _emit_m2_attempt() helper at module scope. Wired to all five
    terminal paths in the per-binding loop where a decision_id is
    valid: Branch A symbol-not-found, Branch B file-not-found, the
    two #280 PR-1 reject paths, the bind_decision exception path, and
    the success path. API-misuse paths (empty/unknown decision_id)
    skip emission to keep the metric meaningful.

  handlers/resolve_compliance.py (+40)
    _emit_m2_ratification() helper, called per accepted verdict.
    Wraps record_ratification() in try/except so a telemetry failure
    never breaks the verdict write.

  ledger/queries.py (+19)
    New get_decision_source() — single-field SELECT, returns the
    decision's source_type (controlled enum from the ingest contract).

  ledger/adapter.py (+10)
    Adapter delegation method.

  dashboard/server.py (+59)
    New GET /m2_grounding endpoint — aggregates the local mirror into
    rolling-7d per-source counts (attempts / rejects / ratified ✓ /
    ratified ✕) and computes precision. Read-only, no ledger I/O.

  assets/dashboard.html (+60)
    New "M2 grounding precision" panel below the main ledger view.
    Color-codes precision per source: green ≥ 85%, amber ≥ 70%, red
    below. Refreshes every 30s.

  CHANGELOG.md (+2)
    Unreleased entry covering all three events + the local mirror
    contract.

Tests
-----

  tests/test_m2_grounding_log.py (9 tests, all green)
    Pure unit tests — no ledger dep. Cover JSONL row shape, verdict
    classification, time-window filtering, and the privacy invariant
    (decision_id never reaches the relay).

  tests/test_bind_m2_telemetry.py (4 tests + 3 skip-on-no-surrealdb)
    Helper-level: emit forwards args correctly, skips on empty
    decision_id, swallows telemetry failures fire-and-forget.
    Resolve-compliance verdict classification covered behind
    `pytest.importorskip("surrealdb")` since the handler module
    imports ledger.queries at top level — runs in CI, skipped local.

Local verification
------------------

  - 12 passed, 3 skipped on tests/test_m2_grounding_log.py +
    tests/test_bind_m2_telemetry.py
  - ruff check + ruff format --check + mypy all green on touched
    files (m2_grounding_log.py, handlers/bind.py,
    handlers/resolve_compliance.py, ledger/queries.py,
    ledger/adapter.py, dashboard/server.py, both new test files)

What's NOT in this PR
---------------------

Per plan-280-grounding-precision-fix.md:
  - Friction capture (≥ 5 design-partner cases) — design-partner
    work, not engineering scope.
  - PR-2 gate-flip (warn → hard) — separate small follow-up after
    PR-3 lands and we have a baseline reading. Aligns with Jin's
    "deliberate not drift" framing.
  - attempt_to_ratify_seconds field — deferred. Would need a
    `created_at` field on the binds_to edge (schema currently has
    only `confidence` + `provenance`); not worth a schema bump in
    this PR.

Closes #280.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CI instead (per Jin)

Jin clarified the operator-dashboard scope: it's for users. M2 grounding
precision is an engineering quality metric, not user-facing. Reverting
the dashboard pieces; adding GitHub Actions step-summary surfacing
which is where engineers actually look for these numbers.

Reverted from PR-3's initial shape
----------------------------------

  assets/dashboard.html
    - Drop the <section id="m2-panel"> block + the renderM2 / loadM2 /
      setInterval JS. Dashboard returns to pre-#280 user view.

  dashboard/server.py
    - Drop the GET /m2_grounding route + _serve_m2_grounding handler.

  m2_grounding_log.py
    - Drop read_recent_events() (only consumer was _serve_m2_grounding;
      now dead code per Jin's "avoid bloat unless product-justified").
    - Drop now-unused `time` import.

  tests/test_m2_grounding_log.py
    - Drop test_read_recent_events_respects_window (function gone) and
      now-unused `os` import.

Added (the new piece)
---------------------

  tests/eval_grounding_recall_summary.py (new)
    Renders the PR-2 eval JSON (test-results/m2-grounding-recall.json)
    as a markdown block — precision / recall / abort-rate scoreboard,
    outcome breakdown, per-case-type recall table, gate-breach line,
    expandable miss-list capped at 25 rows. Fail-quiet: missing/malformed
    JSON degrades to a one-line note rather than failing CI.

  .github/workflows/test-mcp-regression.yml (+10)
    New "M2 metrics summary" step after the M2 eval. Pipes the
    renderer's stdout to $GITHUB_STEP_SUMMARY so the metrics show on
    the GitHub Actions run page without needing the artifact download.
    always() guard so the summary appears even when the eval step
    above warns. continue-on-error keeps it advisory.

Kept from PR-3's initial shape
------------------------------

  - The three PostHog events from handle_bind / handle_resolve_compliance.
  - The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl
    (operator support + diagnose CLI surface; never relayed).
  - The m2_grounding_log.py module's record_attempt / record_ratification
    public API.
  - All telemetry tests (privacy invariant pin still holds).

Net Δ on PR-3: -119 LOC dashboard pieces, +210 LOC summary renderer
+ workflow step. Tests: 11 passed, 3 skipped (resolve_compliance
import-or-skip). Ruff + ruff format + mypy all green.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous revert left an extra blank line where the `<section
id="m2-panel">` block lived. Removes it so assets/dashboard.html is
byte-identical to origin/dev — confirming Jin's "don't change the
user dashboard" intent verbatim.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)
…nds (#277)

Closes #277. Implements v0 Productization §2: shifts team mode entirely
off git as the inter-machine replication substrate, onto a pluggable
backend with two ship-day implementations (LocalFolder, GoogleDrive).
Pull-only sync; no daemons, no webhooks, no Bicameral server in the loop.

What changes for users
- Setup wizard team-mode branch now offers Create vs Join vs LocalFolder.
  Create: provisions a Drive folder under the operator's Google account,
  prints the literal share-text-to-teammates message. Join: paste folder
  ID/URL, OAuth, verify access (404 / read-only both block), confirm the
  resolved signer (default-No) before persisting. LocalFolder: single
  prompt for the path.
- Drive integration uses Bicameral's bundled OAuth client (the same
  pattern gh / gcloud / cursor use). Scope: drive.file only — Bicameral's
  CLI can only see files it creates inside the team folder. Token cache
  at ~/.bicameral/google-drive-token.json mode 0600.
- Colored security disclosure renders before the browser opens, walking
  the operator through what flows where, what we do and don't see, and
  the trust dependency. Mirrored on bicameral-ai.com/privacy
  (BicameralAI/bicameral PR #111).

Architecture
- events/backends/__init__.py — BackendAdapter ABC + get_backend factory.
- events/backends/local_folder.py — sha256-idempotent LocalFolderAdapter.
- events/backends/google_drive.py — Drive Files API adapter; bundled
  client_id + client_secret (RFC 8252 native-app pattern, no env override
  per Option A); FolderNotFoundError / ReadOnlyAccessError surface for
  Join verify_access; create_folder helper for Create branch.
- events/team_adapter.py — TeamWriteAdapter accepts backend=, marks
  _dirty on every write, exposes flush_to_backend().
- adapters/ledger.py — _read_collaboration_mode refactored to
  _read_team_config(repo_path) -> dict; constructs backend and injects
  into TeamWriteAdapter.
- handlers/sync_middleware.py — ensure_team_synced (30 s TTL pull) +
  flush_team_writes (post-handler push); errors swallowed at DEBUG.
- server.py — wires both into the dispatch site (pull at top, flush in
  finally).
- setup_wizard.py — Create/Join/LocalFolder dispatch + colored security
  disclosure + identity-confirmation prompt at Join time.

Testing
- 53 new tests, 1 platform-skip (Windows-only path):
  - LocalFolderAdapter: 6 tests (push idempotency, pull peer-files-only,
    list_peers, lock serialization)
  - TeamWriteAdapter ↔ backend: 3 tests (connect-pulls-then-replays,
    write-marks-dirty-then-flush-pushes, no-backend-noop)
  - Two-author round-trip: 2 tests
  - Sync middleware: 5 tests (TTL cache, no-backend-noop, error swallowing)
  - GoogleDriveAdapter: 11 tests (push idempotency on md5, pull
    own-file-skip + max-modifiedTime token, lock create-then-delete +
    cleanup on exception, verify_access 404 / read-only / can-edit,
    create_folder, placeholder-detection auto-skip when bundled client
    is published)
  - Setup wizard Create/Join: 11 tests including identity decline,
    OAuth-disclosure decline, folder-id URL extraction, unwritable-path
    rejection
- All adjacent regression tests still pass (test_team_event_replay,
  test_event_writer).
- Lint clean across events/ adapters/ handlers/sync_middleware.py
  setup_wizard.py + new test files.

Security model (also documented at docs/team-mode-setup.md and on
bicameral-ai.com/privacy)
- Decision data flows your-CLI ↔ Google directly. Bicameral the company
  does NOT receive copies. No Bicameral server in the loop.
- drive.file scope limits the CLI on the user's machine to files it
  creates in the team folder. The rest of the user's Drive is invisible
  to the CLI; Google enforces this server-side.
- As OAuth app publisher, Bicameral receives aggregate API request
  counts and per-user OAuth consent records (which Google accounts
  authenticated, when). Not contents.
- Trust dependency: same as any OAuth tool (gh, gcloud, Notion, Slack
  desktop) — open-source CLI behaves as advertised, mitigated by source
  visibility.

OAuth verification submission text + GCP setup checklist:
docs/google-oauth-verification-submission.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatting — `ruff format` against the 10 files touched in #277.
No semantic changes. CI's `ruff format --check .` now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…endAdapter

Mypy was failing on `events/backends/__init__.py:62,66` — the factory's
return type is `BackendAdapter | None`, but the two concrete adapters
were structurally compatible without declaring inheritance. Added
explicit `BackendAdapter` base.

Both classes already implemented all four abstract methods (push_events,
pull_events, lock, list_peers) — runtime check (issubclass + concrete
instantiation) passes. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(team-mode): remote event-log adapter — Drive + LocalFolder backends (#277)
Triages 25 dev commits onto main (already on dev as of merge time):
  • #289 — team-mode remote event-log adapter (#277)
  • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280)
  • #275 — README/SECURITY surface
  • plus assorted fixes flowing through dev

Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block,
inserted v0.14.2's release entry from main below it, then renamed
[Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode
section + extended setup-writes table from #289 — main was missing
both because PR #289 hadn't backflowed yet).

pyproject.toml: 0.14.2 → 0.14.3
RECOMMENDED_VERSION: 0.14.1 → 0.14.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 59 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a569177-8f43-4260-a6ba-cb844a4eac0a

📥 Commits

Reviewing files that changed from the base of the PR and between 233d463 and af2873e.

📒 Files selected for processing (54)
  • .github/workflows/test-mcp-regression.yml
  • CHANGELOG.md
  • README.md
  • RECOMMENDED_VERSION
  • adapters/ledger.py
  • code_locator/tools/validate_symbols.py
  • docs/google-oauth-verification-submission.md
  • docs/team-mode-setup.md
  • events/backends/__init__.py
  • events/backends/google_drive.py
  • events/backends/local_folder.py
  • events/team_adapter.py
  • handlers/bind.py
  • handlers/resolve_compliance.py
  • handlers/sync_middleware.py
  • ledger/adapter.py
  • ledger/queries.py
  • m2_grounding_log.py
  • pyproject.toml
  • requirements.txt
  • server.py
  • setup_wizard.py
  • skills/bicameral-bind/SKILL.md
  • skills/bicameral-ingest/SKILL.md
  • tests/eval/_bind_judge.py
  • tests/eval_decision_relevance.py
  • tests/eval_grounding_recall.py
  • tests/eval_grounding_recall_summary.py
  • tests/fixtures/grounding_recall/dataset.py
  • tests/fixtures/grounding_recall/repo/src/admin/orders.py
  • tests/fixtures/grounding_recall/repo/src/auth/session.py
  • tests/fixtures/grounding_recall/repo/src/auth/tokens.py
  • tests/fixtures/grounding_recall/repo/src/billing/refunds.py
  • tests/fixtures/grounding_recall/repo/src/checkout/orders.py
  • tests/fixtures/grounding_recall/repo/src/checkout/retry.py
  • tests/fixtures/grounding_recall/repo/src/checkout/throttle.py
  • tests/fixtures/grounding_recall/repo/src/metrics/collect.py
  • tests/fixtures/grounding_recall/repo/src/metrics/collect.ts
  • tests/fixtures/grounding_recall/repo/src/middleware/global_rate_limit.py
  • tests/fixtures/grounding_recall/repo/src/middleware/tenant_rate_limit.py
  • tests/fixtures/grounding_recall/repo/src/webhooks/dispatch.py
  • tests/fixtures/grounding_recall/repo/src/webhooks/dispatch.ts
  • tests/fixtures/grounding_recall/repo/src/webhooks/verify.py
  • tests/fixtures/grounding_recall/repo/src/webhooks/verify.ts
  • tests/test_backends_google_drive_unit.py
  • tests/test_backends_local_folder.py
  • tests/test_bind.py
  • tests/test_bind_m2_telemetry.py
  • tests/test_m2_grounding_log.py
  • tests/test_setup_wizard_team_backend.py
  • tests/test_sync_middleware_team.py
  • tests/test_team_adapter_with_backend.py
  • tests/test_team_round_trip_local_folder.py
  • thoughts/shared/plans/2026-05-08-remote-event-log-adapter.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch triage-from-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…tor subclass shape

Mypy was failing on triage PR #290:
  events/backends/local_folder.py:72: error: Return type "AsyncIterator[str]"
  of "list_peers" incompatible with return type "Coroutine[Any, Any,
  AsyncIterator[str]]" in supertype "BackendAdapter"

Both concrete adapters implement list_peers as async generators (`async def
... yield`), which return AsyncIterator[str] directly. The ABC's `async def`
declaration typed it as Coroutine[..., AsyncIterator[str]] — a different
shape. Per mypy docs (more_types.html#asynchronous-iterators), async-iterator
methods should be declared `def -> AsyncIterator[T]` in the supertype.

Concrete implementations unchanged; tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants