feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)#285
Conversation
…PR-3) Three PostHog events now emit from the bind / ratification surfaces, plus a local mirror + dashboard panel that reads from it. Closes the last engineering piece of #280 (PR-3 of 3). Events ------ m2_grounding_attempt Fires per `handle_bind` per binding. Carries: - decision_source (controlled enum: transcript/spec/chat/manual/document) - diagnostic.success: bool — bound a region cleanly - diagnostic.handler_rejected: bool — true when #280 PR-1's reject path fired (caller hallucinated a wrong/non-existent symbol on a real file). The split between {success=False, handler_rejected=True} and {success=False, handler_rejected=False} tells operators whether the failure was the failsafe doing its job vs a ledger / IO bug. m2_grounding_ratified_correct (verdict == "compliant") m2_grounding_ratified_incorrect (verdict ∈ {"drifted", "not_relevant"}) Fire per accepted verdict in `handle_resolve_compliance`. Carry: - decision_source (same controlled enum) - diagnostic.confidence: int (low=0, medium=1, high=2) Privacy ------- The relay contract from telemetry.py:14-37 is non-negotiable: numeric/ bool diagnostics only, no decision_id / file_path / symbol_name. The new m2_grounding_log.py owns the split: - JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB rotation, 3 backups) carries decision_id for the dashboard panel's drill-down. Always written, regardless of relay consent. - PostHog relay sees only decision_source + numeric diagnostics — decision_id never crosses that boundary. A unit test (test_decision_id_never_relayed_to_posthog) pins this invariant. Files ----- m2_grounding_log.py (new, 241 LOC) Owner of the M2 event contract. record_attempt(), record_ratification(), read_recent_events(). Lazy-imports server + telemetry to break the handlers→server circular dependency at server-boot time. Test hook via BICAMERAL_M2_LOG_PATH env override (matches preflight_telemetry pattern). handlers/bind.py (+73) _emit_m2_attempt() helper at module scope. Wired to all five terminal paths in the per-binding loop where a decision_id is valid: Branch A symbol-not-found, Branch B file-not-found, the two #280 PR-1 reject paths, the bind_decision exception path, and the success path. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful. handlers/resolve_compliance.py (+40) _emit_m2_ratification() helper, called per accepted verdict. Wraps record_ratification() in try/except so a telemetry failure never breaks the verdict write. ledger/queries.py (+19) New get_decision_source() — single-field SELECT, returns the decision's source_type (controlled enum from the ingest contract). ledger/adapter.py (+10) Adapter delegation method. dashboard/server.py (+59) New GET /m2_grounding endpoint — aggregates the local mirror into rolling-7d per-source counts (attempts / rejects / ratified ✓ / ratified ✕) and computes precision. Read-only, no ledger I/O. assets/dashboard.html (+60) New "M2 grounding precision" panel below the main ledger view. Color-codes precision per source: green ≥ 85%, amber ≥ 70%, red below. Refreshes every 30s. CHANGELOG.md (+2) Unreleased entry covering all three events + the local mirror contract. Tests ----- tests/test_m2_grounding_log.py (9 tests, all green) Pure unit tests — no ledger dep. Cover JSONL row shape, verdict classification, time-window filtering, and the privacy invariant (decision_id never reaches the relay). tests/test_bind_m2_telemetry.py (4 tests + 3 skip-on-no-surrealdb) Helper-level: emit forwards args correctly, skips on empty decision_id, swallows telemetry failures fire-and-forget. Resolve-compliance verdict classification covered behind `pytest.importorskip("surrealdb")` since the handler module imports ledger.queries at top level — runs in CI, skipped local. Local verification ------------------ - 12 passed, 3 skipped on tests/test_m2_grounding_log.py + tests/test_bind_m2_telemetry.py - ruff check + ruff format --check + mypy all green on touched files (m2_grounding_log.py, handlers/bind.py, handlers/resolve_compliance.py, ledger/queries.py, ledger/adapter.py, dashboard/server.py, both new test files) What's NOT in this PR --------------------- Per plan-280-grounding-precision-fix.md: - Friction capture (≥ 5 design-partner cases) — design-partner work, not engineering scope. - PR-2 gate-flip (warn → hard) — separate small follow-up after PR-3 lands and we have a baseline reading. Aligns with Jin's "deliberate not drift" framing. - attempt_to_ratify_seconds field — deferred. Would need a `created_at` field on the binds_to edge (schema currently has only `confidence` + `provenance`); not worth a schema bump in this PR. Closes #280. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…CI instead (per Jin)
Jin clarified the operator-dashboard scope: it's for users. M2 grounding
precision is an engineering quality metric, not user-facing. Reverting
the dashboard pieces; adding GitHub Actions step-summary surfacing
which is where engineers actually look for these numbers.
Reverted from PR-3's initial shape
----------------------------------
assets/dashboard.html
- Drop the <section id="m2-panel"> block + the renderM2 / loadM2 /
setInterval JS. Dashboard returns to pre-#280 user view.
dashboard/server.py
- Drop the GET /m2_grounding route + _serve_m2_grounding handler.
m2_grounding_log.py
- Drop read_recent_events() (only consumer was _serve_m2_grounding;
now dead code per Jin's "avoid bloat unless product-justified").
- Drop now-unused `time` import.
tests/test_m2_grounding_log.py
- Drop test_read_recent_events_respects_window (function gone) and
now-unused `os` import.
Added (the new piece)
---------------------
tests/eval_grounding_recall_summary.py (new)
Renders the PR-2 eval JSON (test-results/m2-grounding-recall.json)
as a markdown block — precision / recall / abort-rate scoreboard,
outcome breakdown, per-case-type recall table, gate-breach line,
expandable miss-list capped at 25 rows. Fail-quiet: missing/malformed
JSON degrades to a one-line note rather than failing CI.
.github/workflows/test-mcp-regression.yml (+10)
New "M2 metrics summary" step after the M2 eval. Pipes the
renderer's stdout to $GITHUB_STEP_SUMMARY so the metrics show on
the GitHub Actions run page without needing the artifact download.
always() guard so the summary appears even when the eval step
above warns. continue-on-error keeps it advisory.
Kept from PR-3's initial shape
------------------------------
- The three PostHog events from handle_bind / handle_resolve_compliance.
- The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl
(operator support + diagnose CLI surface; never relayed).
- The m2_grounding_log.py module's record_attempt / record_ratification
public API.
- All telemetry tests (privacy invariant pin still holds).
Net Δ on PR-3: -119 LOC dashboard pieces, +210 LOC summary renderer
+ workflow step. Tests: 11 passed, 3 skipped (resolve_compliance
import-or-skip). Ruff + ruff format + mypy all green.
Refs #280.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Updated per Jin's note — the operator dashboard is for users; M2 grounding precision is an engineering quality metric and belongs on GitHub instead. Reverted
Added
Kept (no change to telemetry contract)
Net Δ on this PR: −119 LOC dashboard pieces, +210 LOC summary renderer + workflow step. Tests still 11 passed / 3 skipped (resolve_compliance import-or-skip), ruff + format + mypy clean. |
The previous revert left an extra blank line where the `<section id="m2-panel">` block lived. Removes it so assets/dashboard.html is byte-identical to origin/dev — confirming Jin's "don't change the user dashboard" intent verbatim. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Dashboard.html now byte-identical to Net change to the user dashboard from this PR: zero. Confirms Jin's "the dashboard is for users" intent verbatim. Pushed as |
Triages 25 dev commits onto main (already on dev as of merge time): • #289 — team-mode remote event-log adapter (#277) • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280) • #275 — README/SECURITY surface • plus assorted fixes flowing through dev Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block, inserted v0.14.2's release entry from main below it, then renamed [Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode section + extended setup-writes table from #289 — main was missing both because PR #289 hadn't backflowed yet). pyproject.toml: 0.14.2 → 0.14.3 RECOMMENDED_VERSION: 0.14.1 → 0.14.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ne (BicameralAI#280) PR BicameralAI#285's first CI run produced a clean baseline: 23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000 ✓ all gates pass That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30). Locking the baseline in before drift sets in. Two changes to .github/workflows/test-mcp-regression.yml: 1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero on breach instead of warning to step output. 2. Removed `continue-on-error: true` from the eval step. The step now fails CI when the gate breaches. The metrics-summary step keeps `continue-on-error: true` so a renderer bug never masks the eval result — and the `always()` guard means the breach summary is still rendered inline when the eval fails. After this lands, PRs that touch the bind handler / bind skill / fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 / abort_rate ≤ 0.30, OR deliberately re-record the cache by setting BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval *should* have taken (M1 has been warn-only forever; M2 is being flipped while the baseline is fresh, days after the eval shipped). Refs BicameralAI#280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…M_skill_preflight CI surfacing Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY. ## Part B — Step-0 invocation harness The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1. New files: - ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table. - ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern. - ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance. - ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven). Step-0 baseline (Sonnet 4.5, 15 fixtures committed): | Outcome | Count | Cell | |---|---|---| | invoked_history_correctly | 8 | TP | | skipped_history_should_have | 0 | FN — load-bearing failure mode | | invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch | | proceeded_without_fetch | 6 | TN | | Metric | Value | Gate | |---|---|---| | Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ | | Precision (TP/(TP+FP)) | 88.9% (8/9) | — | | FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ | The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8). The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals. ## Part C — CI surfacing Three new CLI runners + one summary renderer mirror the M2/M6 pattern: - ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown). - ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate). - ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304). Workflow wiring: - ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true. - ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern). ## Cache discipline - 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``. - CI runs cache-hits-only after merge. - Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA). ## Sample-size note 15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in. ## Acceptance touched - [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA). - [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip. - [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``. - [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators). - [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply). - [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs. ## Verification ``` $ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py 8 passed, 15 skipped (no API key; fixtures will hit cache on CI) $ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py 15/16 pass (1 expected FP on S0_dark_mode — warn-only) $ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json $ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json $ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json [renders the full M_skill_preflight markdown block] $ ruff check + format + mypy → clean ``` ## Test plan - [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml - [ ] M_skill_preflight block visible on the run summary page after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PR-3 of 3 for #280 — the last engineering piece. Wires three PostHog events from the bind / ratification surfaces, adds a privacy-preserving local mirror, and surfaces M2 metrics on the GitHub Actions run summary.
PR-1 (#283) and PR-2 (#284) are already in
dev; this branch is forked fromf8cd9ee.Events
m2_grounding_attempthandle_bind(per binding)success(bool),handler_rejected(bool — true when #280 PR-1's reject path fires)m2_grounding_ratified_correcthandle_resolve_compliance(verdict ==compliant)confidence(int 0/1/2)m2_grounding_ratified_incorrect{drifted, not_relevant})confidenceThe
success/handler_rejectedsplit on the attempt event is deliberate: it tells operators whether a failure was the failsafe doing its job (caller hallucinated a wrong/non-existent symbol — PR-1's reject path) vs. an unrelated ledger / IO bug.Privacy contract (per
telemetry.py:14-37)decision_source(controlled enum:transcript/spec/chat/manual/document)diagnostic.success,diagnostic.handler_rejected,diagnostic.confidencedecision_id(opaque ledger UUID)verdict(string)A unit test (
test_decision_id_never_relayed_to_posthog) pins this invariant — stubstelemetry.send_eventand asserts the relay payload containsdecision_sourcebut NOTdecision_id.Files
m2_grounding_log.py(new)record_attempt(),record_ratification(). JSONL local mirror at~/.bicameral/m2_grounding.jsonl(10 MB rotation, 3 backups) + lazy-imported PostHog relay. Test hook viaBICAMERAL_M2_LOG_PATHenv override.handlers/bind.py_emit_m2_attempt()helper wired to all 5 terminal paths in the per-binding loop wheredecision_idis valid. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful.handlers/resolve_compliance.py_emit_m2_ratification()helper, called per accepted verdict.ledger/queries.pyget_decision_source()— single-field SELECT, returns the controlled enum.ledger/adapter.pytests/eval_grounding_recall_summary.py(new)$GITHUB_STEP_SUMMARY— precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall table, gate-breach line, expandable miss-list. Fail-quiet on missing/malformed JSON..github/workflows/test-mcp-regression.ymlM2 metrics summarystep after the eval pipes the renderer's stdout to$GITHUB_STEP_SUMMARY.always()+continue-on-error: trueso the summary appears even when the warn-only eval flags breaches.tests/test_m2_grounding_log.py(new)tests/test_bind_m2_telemetry.py(new)CHANGELOG.mdNet Δ on the user dashboard surface (
assets/dashboard.html,dashboard/server.py): zero.CI step output preview
Rendered from a synthetic eval JSON to confirm the markdown shape:
Local verification
surrealdb; CI runs them) ontests/test_m2_grounding_log.py+tests/test_bind_m2_telemetry.pyruff check+ruff format --check+mypyall green on touched filesbicameral.link_commitclean — 0 drift, 0 pending checksgit diff origin/dev -- assets/dashboard.html→ 0 linesWhat's NOT in this PR
Per
plan-280-grounding-precision-fix.md:attempt_to_ratify_secondsfield on the ratification events — deferred. Would need acreated_atfield on thebinds_toedge (schema currently has onlyconfidence+provenance); not worth a schema bump in this PR.Refs
Closes the engineering work for #280. Friction capture and gate-flip can run in parallel as separate follow-ups.
🤖 Generated with Claude Code