feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3) by silongtan · Pull Request #285 · BicameralAI/bicameral-mcp

silongtan · 2026-05-09T01:28:41Z

Summary

PR-3 of 3 for #280 — the last engineering piece. Wires three PostHog events from the bind / ratification surfaces, adds a privacy-preserving local mirror, and surfaces M2 metrics on the GitHub Actions run summary.

PR-1 (#283) and PR-2 (#284) are already in dev; this branch is forked from f8cd9ee.

Note — the initial PR-3 also added an M2 panel to the operator dashboard (assets/dashboard.html) and a /m2_grounding endpoint. Per Jin's clarification ("the dashboard is for users; CI for it that shows up on github would be good"), those were reverted — assets/dashboard.html is now byte-identical to origin/dev. M2 surfaces on the GitHub Actions run page instead.

Events

Event	Fires from	Diagnostics
`m2_grounding_attempt`	`handle_bind` (per binding)	`success` (bool), `handler_rejected` (bool — true when #280 PR-1's reject path fires)
`m2_grounding_ratified_correct`	`handle_resolve_compliance` (verdict == `compliant`)	`confidence` (int 0/1/2)
`m2_grounding_ratified_incorrect`	same (verdict ∈ `{drifted, not_relevant}`)	`confidence`

The success/handler_rejected split on the attempt event is deliberate: it tells operators whether a failure was the failsafe doing its job (caller hallucinated a wrong/non-existent symbol — PR-1's reject path) vs. an unrelated ledger / IO bug.

Privacy contract (per `telemetry.py:14-37`)

Field	Where it lands
`decision_source` (controlled enum: `transcript` / `spec` / `chat` / `manual` / `document`)	Local mirror + PostHog ✓
`diagnostic.success`, `diagnostic.handler_rejected`, `diagnostic.confidence`	Local mirror + PostHog ✓
`decision_id` (opaque ledger UUID)	Local mirror ONLY — never relayed to PostHog
`verdict` (string)	Local mirror only

A unit test (test_decision_id_never_relayed_to_posthog) pins this invariant — stubs telemetry.send_event and asserts the relay payload contains decision_source but NOT decision_id.

Files

File	Δ	Role
`m2_grounding_log.py` (new)	+211	Owner of the M2 event contract. `record_attempt()`, `record_ratification()`. JSONL local mirror at `~/.bicameral/m2_grounding.jsonl` (10 MB rotation, 3 backups) + lazy-imported PostHog relay. Test hook via `BICAMERAL_M2_LOG_PATH` env override.
`handlers/bind.py`	+73	`_emit_m2_attempt()` helper wired to all 5 terminal paths in the per-binding loop where `decision_id` is valid. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful.
`handlers/resolve_compliance.py`	+40	`_emit_m2_ratification()` helper, called per accepted verdict.
`ledger/queries.py`	+19	New `get_decision_source()` — single-field SELECT, returns the controlled enum.
`ledger/adapter.py`	+10	Adapter delegation method.
`tests/eval_grounding_recall_summary.py` (new)	+173	Renders the PR-2 eval JSON to `$GITHUB_STEP_SUMMARY` — precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall table, gate-breach line, expandable miss-list. Fail-quiet on missing/malformed JSON.
`.github/workflows/test-mcp-regression.yml`	+10	New `M2 metrics summary` step after the eval pipes the renderer's stdout to `$GITHUB_STEP_SUMMARY`. `always()` + `continue-on-error: true` so the summary appears even when the warn-only eval flags breaches.
`tests/test_m2_grounding_log.py` (new)	+180	8 pure-function unit tests — JSONL row shape, verdict classification, privacy invariant.
`tests/test_bind_m2_telemetry.py` (new)	+141	4 helper-level tests + 3 verdict-classification cases (skip-on-no-surrealdb for the resolve_compliance path).
`CHANGELOG.md`	+2	Unreleased entry.

Net Δ on the user dashboard surface (assets/dashboard.html, dashboard/server.py): zero.

CI step output preview

Rendered from a synthetic eval JSON to confirm the markdown shape:

M2 grounding precision (caller-LLM bind eval, #280)

Metric Value Gate

Precision 85.7% ≥ 85.0% ✅

Recall 78.3% ≥ 80.0% ⚠️

Abort rate 8.7% ≤ 30.0% ✅

Plus outcome breakdown, per-case-type recall, gate-breach line, and a <details> collapsible miss-list (capped at 25 rows).

Local verification

✅ 11 passed, 3 skipped (resolve_compliance tests skip locally on missing surrealdb; CI runs them) on tests/test_m2_grounding_log.py + tests/test_bind_m2_telemetry.py
✅ ruff check + ruff format --check + mypy all green on touched files
✅ Renderer smoke-tested on a synthetic input — markdown output renders cleanly
✅ bicameral.link_commit clean — 0 drift, 0 pending checks
✅ git diff origin/dev -- assets/dashboard.html → 0 lines

What's NOT in this PR

Per plan-280-grounding-precision-fix.md:

Friction capture (≥ 5 design-partner cases) — design-partner work, not engineering scope. Closes fix(tool,skill): caller-LLM grounding produces incorrect decision bindings (M2 regression) #280's last open acceptance criterion via comments on the issue.
PR-2 gate-flip (warn → hard) — separate small follow-up after PR-3 lands and we have a baseline reading. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval should have taken.
attempt_to_ratify_seconds field on the ratification events — deferred. Would need a created_at field on the binds_to edge (schema currently has only confidence + provenance); not worth a schema bump in this PR.

Refs

Closes the engineering work for #280. Friction capture and gate-flip can run in parallel as separate follow-ups.

🤖 Generated with Claude Code

…PR-3) Three PostHog events now emit from the bind / ratification surfaces, plus a local mirror + dashboard panel that reads from it. Closes the last engineering piece of #280 (PR-3 of 3). Events ------ m2_grounding_attempt Fires per `handle_bind` per binding. Carries: - decision_source (controlled enum: transcript/spec/chat/manual/document) - diagnostic.success: bool — bound a region cleanly - diagnostic.handler_rejected: bool — true when #280 PR-1's reject path fired (caller hallucinated a wrong/non-existent symbol on a real file). The split between {success=False, handler_rejected=True} and {success=False, handler_rejected=False} tells operators whether the failure was the failsafe doing its job vs a ledger / IO bug. m2_grounding_ratified_correct (verdict == "compliant") m2_grounding_ratified_incorrect (verdict ∈ {"drifted", "not_relevant"}) Fire per accepted verdict in `handle_resolve_compliance`. Carry: - decision_source (same controlled enum) - diagnostic.confidence: int (low=0, medium=1, high=2) Privacy ------- The relay contract from telemetry.py:14-37 is non-negotiable: numeric/ bool diagnostics only, no decision_id / file_path / symbol_name. The new m2_grounding_log.py owns the split: - JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB rotation, 3 backups) carries decision_id for the dashboard panel's drill-down. Always written, regardless of relay consent. - PostHog relay sees only decision_source + numeric diagnostics — decision_id never crosses that boundary. A unit test (test_decision_id_never_relayed_to_posthog) pins this invariant. Files ----- m2_grounding_log.py (new, 241 LOC) Owner of the M2 event contract. record_attempt(), record_ratification(), read_recent_events(). Lazy-imports server + telemetry to break the handlers→server circular dependency at server-boot time. Test hook via BICAMERAL_M2_LOG_PATH env override (matches preflight_telemetry pattern). handlers/bind.py (+73) _emit_m2_attempt() helper at module scope. Wired to all five terminal paths in the per-binding loop where a decision_id is valid: Branch A symbol-not-found, Branch B file-not-found, the two #280 PR-1 reject paths, the bind_decision exception path, and the success path. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful. handlers/resolve_compliance.py (+40) _emit_m2_ratification() helper, called per accepted verdict. Wraps record_ratification() in try/except so a telemetry failure never breaks the verdict write. ledger/queries.py (+19) New get_decision_source() — single-field SELECT, returns the decision's source_type (controlled enum from the ingest contract). ledger/adapter.py (+10) Adapter delegation method. dashboard/server.py (+59) New GET /m2_grounding endpoint — aggregates the local mirror into rolling-7d per-source counts (attempts / rejects / ratified ✓ / ratified ✕) and computes precision. Read-only, no ledger I/O. assets/dashboard.html (+60) New "M2 grounding precision" panel below the main ledger view. Color-codes precision per source: green ≥ 85%, amber ≥ 70%, red below. Refreshes every 30s. CHANGELOG.md (+2) Unreleased entry covering all three events + the local mirror contract. Tests ----- tests/test_m2_grounding_log.py (9 tests, all green) Pure unit tests — no ledger dep. Cover JSONL row shape, verdict classification, time-window filtering, and the privacy invariant (decision_id never reaches the relay). tests/test_bind_m2_telemetry.py (4 tests + 3 skip-on-no-surrealdb) Helper-level: emit forwards args correctly, skips on empty decision_id, swallows telemetry failures fire-and-forget. Resolve-compliance verdict classification covered behind `pytest.importorskip("surrealdb")` since the handler module imports ledger.queries at top level — runs in CI, skipped local. Local verification ------------------ - 12 passed, 3 skipped on tests/test_m2_grounding_log.py + tests/test_bind_m2_telemetry.py - ruff check + ruff format --check + mypy all green on touched files (m2_grounding_log.py, handlers/bind.py, handlers/resolve_compliance.py, ledger/queries.py, ledger/adapter.py, dashboard/server.py, both new test files) What's NOT in this PR --------------------- Per plan-280-grounding-precision-fix.md: - Friction capture (≥ 5 design-partner cases) — design-partner work, not engineering scope. - PR-2 gate-flip (warn → hard) — separate small follow-up after PR-3 lands and we have a baseline reading. Aligns with Jin's "deliberate not drift" framing. - attempt_to_ratify_seconds field — deferred. Would need a `created_at` field on the binds_to edge (schema currently has only `confidence` + `provenance`); not worth a schema bump in this PR. Closes #280. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-09T01:28:48Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 518153a4-6adc-49e5-ba60-6e5f9867f1d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 280-grounding-telemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…CI instead (per Jin) Jin clarified the operator-dashboard scope: it's for users. M2 grounding precision is an engineering quality metric, not user-facing. Reverting the dashboard pieces; adding GitHub Actions step-summary surfacing which is where engineers actually look for these numbers. Reverted from PR-3's initial shape ---------------------------------- assets/dashboard.html - Drop the <section id="m2-panel"> block + the renderM2 / loadM2 / setInterval JS. Dashboard returns to pre-#280 user view. dashboard/server.py - Drop the GET /m2_grounding route + _serve_m2_grounding handler. m2_grounding_log.py - Drop read_recent_events() (only consumer was _serve_m2_grounding; now dead code per Jin's "avoid bloat unless product-justified"). - Drop now-unused `time` import. tests/test_m2_grounding_log.py - Drop test_read_recent_events_respects_window (function gone) and now-unused `os` import. Added (the new piece) --------------------- tests/eval_grounding_recall_summary.py (new) Renders the PR-2 eval JSON (test-results/m2-grounding-recall.json) as a markdown block — precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall table, gate-breach line, expandable miss-list capped at 25 rows. Fail-quiet: missing/malformed JSON degrades to a one-line note rather than failing CI. .github/workflows/test-mcp-regression.yml (+10) New "M2 metrics summary" step after the M2 eval. Pipes the renderer's stdout to $GITHUB_STEP_SUMMARY so the metrics show on the GitHub Actions run page without needing the artifact download. always() guard so the summary appears even when the eval step above warns. continue-on-error keeps it advisory. Kept from PR-3's initial shape ------------------------------ - The three PostHog events from handle_bind / handle_resolve_compliance. - The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl (operator support + diagnose CLI surface; never relayed). - The m2_grounding_log.py module's record_attempt / record_ratification public API. - All telemetry tests (privacy invariant pin still holds). Net Δ on PR-3: -119 LOC dashboard pieces, +210 LOC summary renderer + workflow step. Tests: 11 passed, 3 skipped (resolve_compliance import-or-skip). Ruff + ruff format + mypy all green. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan · 2026-05-09T02:50:49Z

Updated per Jin's note — the operator dashboard is for users; M2 grounding precision is an engineering quality metric and belongs on GitHub instead.

Reverted

assets/dashboard.html — dropped the <section id="m2-panel"> block + the renderM2 / loadM2 JS. Dashboard returns to its pre-fix(tool,skill): caller-LLM grounding produces incorrect decision bindings (M2 regression) #280 user view.
dashboard/server.py — dropped the GET /m2_grounding route + _serve_m2_grounding handler.
m2_grounding_log.py — dropped read_recent_events() (only consumer was the dashboard endpoint; now dead code).
Companion test removed; unused os / time imports cleaned up.

Added

tests/eval_grounding_recall_summary.py — renders the PR-2 eval JSON as a GitHub-Actions step-summary markdown block. Precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall, gate-breach line, expandable miss-list capped at 25 rows. Fail-quiet on missing/malformed JSON.
New CI step in test-mcp-regression.yml: M2 metrics summary runs after the eval and pipes the renderer's stdout to $GITHUB_STEP_SUMMARY. Engineers see the numbers on the run page without downloading the artifact.

Kept (no change to telemetry contract)

The three m2_grounding_* PostHog events from handle_bind / handle_resolve_compliance.
The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl (still useful for operator-support / bicameral-mcp diagnose; decision_id still local-only, never relayed).
All telemetry tests, including the privacy-invariant pin.

Net Δ on this PR: −119 LOC dashboard pieces, +210 LOC summary renderer + workflow step. Tests still 11 passed / 3 skipped (resolve_compliance import-or-skip), ruff + format + mypy clean.

The previous revert left an extra blank line where the `<section id="m2-panel">` block lived. Removes it so assets/dashboard.html is byte-identical to origin/dev — confirming Jin's "don't change the user dashboard" intent verbatim. Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan · 2026-05-09T02:53:17Z

Dashboard.html now byte-identical to origin/dev — git diff origin/dev -- assets/dashboard.html returns 0. The previous revert left a leftover blank line where the <section> block lived; cleaned that up.

Net change to the user dashboard from this PR: zero. Confirms Jin's "the dashboard is for users" intent verbatim.

Pushed as e929a42.

Triages 25 dev commits onto main (already on dev as of merge time): • #289 — team-mode remote event-log adapter (#277) • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280) • #275 — README/SECURITY surface • plus assorted fixes flowing through dev Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block, inserted v0.14.2's release entry from main below it, then renamed [Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode section + extended setup-writes table from #289 — main was missing both because PR #289 hadn't backflowed yet). pyproject.toml: 0.14.2 → 0.14.3 RECOMMENDED_VERSION: 0.14.1 → 0.14.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ne (BicameralAI#280) PR BicameralAI#285's first CI run produced a clean baseline: 23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000 ✓ all gates pass That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30). Locking the baseline in before drift sets in. Two changes to .github/workflows/test-mcp-regression.yml: 1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero on breach instead of warning to step output. 2. Removed `continue-on-error: true` from the eval step. The step now fails CI when the gate breaches. The metrics-summary step keeps `continue-on-error: true` so a renderer bug never masks the eval result — and the `always()` guard means the breach summary is still rendered inline when the eval fails. After this lands, PRs that touch the bind handler / bind skill / fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 / abort_rate ≤ 0.30, OR deliberately re-record the cache by setting BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval *should* have taken (M1 has been warn-only forever; M2 is being flipped while the baseline is fresh, days after the eval shipped). Refs BicameralAI#280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…M_skill_preflight CI surfacing Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY. ## Part B — Step-0 invocation harness The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1. New files: - ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table. - ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern. - ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance. - ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven). Step-0 baseline (Sonnet 4.5, 15 fixtures committed): | Outcome | Count | Cell | |---|---|---| | invoked_history_correctly | 8 | TP | | skipped_history_should_have | 0 | FN — load-bearing failure mode | | invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch | | proceeded_without_fetch | 6 | TN | | Metric | Value | Gate | |---|---|---| | Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ | | Precision (TP/(TP+FP)) | 88.9% (8/9) | — | | FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ | The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8). The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals. ## Part C — CI surfacing Three new CLI runners + one summary renderer mirror the M2/M6 pattern: - ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown). - ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate). - ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304). Workflow wiring: - ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true. - ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern). ## Cache discipline - 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``. - CI runs cache-hits-only after merge. - Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA). ## Sample-size note 15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in. ## Acceptance touched - [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA). - [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip. - [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``. - [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators). - [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply). - [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs. ## Verification ``` $ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py 8 passed, 15 skipped (no API key; fixtures will hit cache on CI) $ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py 15/16 pass (1 expected FP on S0_dark_mode — warn-only) $ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json $ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json $ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json [renders the full M_skill_preflight markdown block] $ ruff check + format + mypy → clean ``` ## Test plan - [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml - [ ] M_skill_preflight block visible on the run summary page after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan had a problem deploying to recording-approval May 9, 2026 01:28 — with GitHub Actions Failure

silongtan temporarily deployed to ci-test May 9, 2026 01:28 — with GitHub Actions Inactive

silongtan temporarily deployed to production May 9, 2026 01:28 — with GitHub Actions Inactive

silongtan requested review from Knapp-Kevin and jinhongkuan May 9, 2026 02:15

silongtan temporarily deployed to ci-test May 9, 2026 02:50 — with GitHub Actions Inactive

silongtan temporarily deployed to production May 9, 2026 02:50 — with GitHub Actions Inactive

silongtan temporarily deployed to ci-test May 9, 2026 02:50 — with GitHub Actions Inactive

silongtan had a problem deploying to recording-approval May 9, 2026 02:50 — with GitHub Actions Failure

silongtan had a problem deploying to recording-approval May 9, 2026 02:53 — with GitHub Actions Failure

silongtan temporarily deployed to ci-test May 9, 2026 02:53 — with GitHub Actions Inactive

silongtan temporarily deployed to production May 9, 2026 02:53 — with GitHub Actions Inactive

silongtan temporarily deployed to ci-test May 9, 2026 02:53 — with GitHub Actions Inactive

jinhongkuan approved these changes May 9, 2026

View reviewed changes

jinhongkuan merged commit 58f0efa into dev May 9, 2026
9 of 10 checks passed

silongtan deleted the 280-grounding-telemetry branch May 9, 2026 03:01

silongtan mentioned this pull request May 9, 2026

ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280) #288

Merged

jinhongkuan mentioned this pull request May 9, 2026

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision) #290

Merged

5 tasks

silongtan mentioned this pull request May 12, 2026

test(eval): expand skill-layer preflight eval for vocab + unbound decision recall (#58 followup) #306

Closed

7 tasks

This was referenced May 12, 2026

M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58

Closed

test(eval): #306 Part B + C — Step-0 invocation harness + M_skill_preflight CI surfacing #397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)#285

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)#285
jinhongkuan merged 3 commits into
devfrom
280-grounding-telemetry

silongtan commented May 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Review skipped

Uh oh!

silongtan commented May 9, 2026

Uh oh!

silongtan commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Metric	Value	Gate
Precision	85.7%	≥ 85.0%	✅
Recall	78.3%	≥ 80.0%	⚠️
Abort rate	8.7%	≤ 30.0%	✅

Conversation

silongtan commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Events

Privacy contract (per telemetry.py:14-37)

Files

CI step output preview

M2 grounding precision (caller-LLM bind eval, #280)

Local verification

What's NOT in this PR

Refs

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

silongtan commented May 9, 2026

Reverted

Added

Kept (no change to telemetry contract)

Uh oh!

silongtan commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

silongtan commented May 9, 2026 •

edited

Loading

Privacy contract (per `telemetry.py:14-37`)

coderabbitai Bot commented May 9, 2026 •

edited

Loading