Skip to content

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)#285

Merged
jinhongkuan merged 3 commits into
devfrom
280-grounding-telemetry
May 9, 2026
Merged

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3)#285
jinhongkuan merged 3 commits into
devfrom
280-grounding-telemetry

Conversation

@silongtan

@silongtan silongtan commented May 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR-3 of 3 for #280 — the last engineering piece. Wires three PostHog events from the bind / ratification surfaces, adds a privacy-preserving local mirror, and surfaces M2 metrics on the GitHub Actions run summary.

PR-1 (#283) and PR-2 (#284) are already in dev; this branch is forked from f8cd9ee.

Note — the initial PR-3 also added an M2 panel to the operator dashboard (assets/dashboard.html) and a /m2_grounding endpoint. Per Jin's clarification ("the dashboard is for users; CI for it that shows up on github would be good"), those were reverted — assets/dashboard.html is now byte-identical to origin/dev. M2 surfaces on the GitHub Actions run page instead.

Events

Event Fires from Diagnostics
m2_grounding_attempt handle_bind (per binding) success (bool), handler_rejected (bool — true when #280 PR-1's reject path fires)
m2_grounding_ratified_correct handle_resolve_compliance (verdict == compliant) confidence (int 0/1/2)
m2_grounding_ratified_incorrect same (verdict ∈ {drifted, not_relevant}) confidence

The success/handler_rejected split on the attempt event is deliberate: it tells operators whether a failure was the failsafe doing its job (caller hallucinated a wrong/non-existent symbol — PR-1's reject path) vs. an unrelated ledger / IO bug.

Privacy contract (per telemetry.py:14-37)

Field Where it lands
decision_source (controlled enum: transcript / spec / chat / manual / document) Local mirror + PostHog ✓
diagnostic.success, diagnostic.handler_rejected, diagnostic.confidence Local mirror + PostHog ✓
decision_id (opaque ledger UUID) Local mirror ONLY — never relayed to PostHog
verdict (string) Local mirror only

A unit test (test_decision_id_never_relayed_to_posthog) pins this invariant — stubs telemetry.send_event and asserts the relay payload contains decision_source but NOT decision_id.

Files

File Δ Role
m2_grounding_log.py (new) +211 Owner of the M2 event contract. record_attempt(), record_ratification(). JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB rotation, 3 backups) + lazy-imported PostHog relay. Test hook via BICAMERAL_M2_LOG_PATH env override.
handlers/bind.py +73 _emit_m2_attempt() helper wired to all 5 terminal paths in the per-binding loop where decision_id is valid. API-misuse paths (empty/unknown decision_id) skip emission to keep the metric meaningful.
handlers/resolve_compliance.py +40 _emit_m2_ratification() helper, called per accepted verdict.
ledger/queries.py +19 New get_decision_source() — single-field SELECT, returns the controlled enum.
ledger/adapter.py +10 Adapter delegation method.
tests/eval_grounding_recall_summary.py (new) +173 Renders the PR-2 eval JSON to $GITHUB_STEP_SUMMARY — precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall table, gate-breach line, expandable miss-list. Fail-quiet on missing/malformed JSON.
.github/workflows/test-mcp-regression.yml +10 New M2 metrics summary step after the eval pipes the renderer's stdout to $GITHUB_STEP_SUMMARY. always() + continue-on-error: true so the summary appears even when the warn-only eval flags breaches.
tests/test_m2_grounding_log.py (new) +180 8 pure-function unit tests — JSONL row shape, verdict classification, privacy invariant.
tests/test_bind_m2_telemetry.py (new) +141 4 helper-level tests + 3 verdict-classification cases (skip-on-no-surrealdb for the resolve_compliance path).
CHANGELOG.md +2 Unreleased entry.

Net Δ on the user dashboard surface (assets/dashboard.html, dashboard/server.py): zero.

CI step output preview

Rendered from a synthetic eval JSON to confirm the markdown shape:

M2 grounding precision (caller-LLM bind eval, #280)

Metric Value Gate
Precision 85.7% ≥ 85.0%
Recall 78.3% ≥ 80.0% ⚠️
Abort rate 8.7% ≤ 30.0%

Plus outcome breakdown, per-case-type recall, gate-breach line, and a <details> collapsible miss-list (capped at 25 rows).

Local verification

  • ✅ 11 passed, 3 skipped (resolve_compliance tests skip locally on missing surrealdb; CI runs them) on tests/test_m2_grounding_log.py + tests/test_bind_m2_telemetry.py
  • ruff check + ruff format --check + mypy all green on touched files
  • ✅ Renderer smoke-tested on a synthetic input — markdown output renders cleanly
  • bicameral.link_commit clean — 0 drift, 0 pending checks
  • git diff origin/dev -- assets/dashboard.html → 0 lines

What's NOT in this PR

Per plan-280-grounding-precision-fix.md:

  • Friction capture (≥ 5 design-partner cases) — design-partner work, not engineering scope. Closes fix(tool,skill): caller-LLM grounding produces incorrect decision bindings (M2 regression) #280's last open acceptance criterion via comments on the issue.
  • PR-2 gate-flip (warn → hard) — separate small follow-up after PR-3 lands and we have a baseline reading. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval should have taken.
  • attempt_to_ratify_seconds field on the ratification events — deferred. Would need a created_at field on the binds_to edge (schema currently has only confidence + provenance); not worth a schema bump in this PR.

Refs

Closes the engineering work for #280. Friction capture and gate-flip can run in parallel as separate follow-ups.

🤖 Generated with Claude Code

…PR-3)

Three PostHog events now emit from the bind / ratification surfaces,
plus a local mirror + dashboard panel that reads from it. Closes the
last engineering piece of #280 (PR-3 of 3).

Events
------

  m2_grounding_attempt
    Fires per `handle_bind` per binding. Carries:
      - decision_source (controlled enum: transcript/spec/chat/manual/document)
      - diagnostic.success: bool — bound a region cleanly
      - diagnostic.handler_rejected: bool — true when #280 PR-1's
        reject path fired (caller hallucinated a wrong/non-existent
        symbol on a real file). The split between {success=False,
        handler_rejected=True} and {success=False, handler_rejected=False}
        tells operators whether the failure was the failsafe doing its
        job vs a ledger / IO bug.

  m2_grounding_ratified_correct  (verdict == "compliant")
  m2_grounding_ratified_incorrect (verdict ∈ {"drifted", "not_relevant"})
    Fire per accepted verdict in `handle_resolve_compliance`. Carry:
      - decision_source (same controlled enum)
      - diagnostic.confidence: int (low=0, medium=1, high=2)

Privacy
-------

The relay contract from telemetry.py:14-37 is non-negotiable: numeric/
bool diagnostics only, no decision_id / file_path / symbol_name. The
new m2_grounding_log.py owns the split:

  - JSONL local mirror at ~/.bicameral/m2_grounding.jsonl (10 MB
    rotation, 3 backups) carries decision_id for the dashboard panel's
    drill-down. Always written, regardless of relay consent.
  - PostHog relay sees only decision_source + numeric diagnostics —
    decision_id never crosses that boundary. A unit test
    (test_decision_id_never_relayed_to_posthog) pins this invariant.

Files
-----

  m2_grounding_log.py (new, 241 LOC)
    Owner of the M2 event contract. record_attempt(), record_ratification(),
    read_recent_events(). Lazy-imports server + telemetry to break the
    handlers→server circular dependency at server-boot time. Test hook
    via BICAMERAL_M2_LOG_PATH env override (matches preflight_telemetry
    pattern).

  handlers/bind.py (+73)
    _emit_m2_attempt() helper at module scope. Wired to all five
    terminal paths in the per-binding loop where a decision_id is
    valid: Branch A symbol-not-found, Branch B file-not-found, the
    two #280 PR-1 reject paths, the bind_decision exception path, and
    the success path. API-misuse paths (empty/unknown decision_id)
    skip emission to keep the metric meaningful.

  handlers/resolve_compliance.py (+40)
    _emit_m2_ratification() helper, called per accepted verdict.
    Wraps record_ratification() in try/except so a telemetry failure
    never breaks the verdict write.

  ledger/queries.py (+19)
    New get_decision_source() — single-field SELECT, returns the
    decision's source_type (controlled enum from the ingest contract).

  ledger/adapter.py (+10)
    Adapter delegation method.

  dashboard/server.py (+59)
    New GET /m2_grounding endpoint — aggregates the local mirror into
    rolling-7d per-source counts (attempts / rejects / ratified ✓ /
    ratified ✕) and computes precision. Read-only, no ledger I/O.

  assets/dashboard.html (+60)
    New "M2 grounding precision" panel below the main ledger view.
    Color-codes precision per source: green ≥ 85%, amber ≥ 70%, red
    below. Refreshes every 30s.

  CHANGELOG.md (+2)
    Unreleased entry covering all three events + the local mirror
    contract.

Tests
-----

  tests/test_m2_grounding_log.py (9 tests, all green)
    Pure unit tests — no ledger dep. Cover JSONL row shape, verdict
    classification, time-window filtering, and the privacy invariant
    (decision_id never reaches the relay).

  tests/test_bind_m2_telemetry.py (4 tests + 3 skip-on-no-surrealdb)
    Helper-level: emit forwards args correctly, skips on empty
    decision_id, swallows telemetry failures fire-and-forget.
    Resolve-compliance verdict classification covered behind
    `pytest.importorskip("surrealdb")` since the handler module
    imports ledger.queries at top level — runs in CI, skipped local.

Local verification
------------------

  - 12 passed, 3 skipped on tests/test_m2_grounding_log.py +
    tests/test_bind_m2_telemetry.py
  - ruff check + ruff format --check + mypy all green on touched
    files (m2_grounding_log.py, handlers/bind.py,
    handlers/resolve_compliance.py, ledger/queries.py,
    ledger/adapter.py, dashboard/server.py, both new test files)

What's NOT in this PR
---------------------

Per plan-280-grounding-precision-fix.md:
  - Friction capture (≥ 5 design-partner cases) — design-partner
    work, not engineering scope.
  - PR-2 gate-flip (warn → hard) — separate small follow-up after
    PR-3 lands and we have a baseline reading. Aligns with Jin's
    "deliberate not drift" framing.
  - attempt_to_ratify_seconds field — deferred. Would need a
    `created_at` field on the binds_to edge (schema currently has
    only `confidence` + `provenance`); not worth a schema bump in
    this PR.

Closes #280.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 518153a4-6adc-49e5-ba60-6e5f9867f1d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 280-grounding-telemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…CI instead (per Jin)

Jin clarified the operator-dashboard scope: it's for users. M2 grounding
precision is an engineering quality metric, not user-facing. Reverting
the dashboard pieces; adding GitHub Actions step-summary surfacing
which is where engineers actually look for these numbers.

Reverted from PR-3's initial shape
----------------------------------

  assets/dashboard.html
    - Drop the <section id="m2-panel"> block + the renderM2 / loadM2 /
      setInterval JS. Dashboard returns to pre-#280 user view.

  dashboard/server.py
    - Drop the GET /m2_grounding route + _serve_m2_grounding handler.

  m2_grounding_log.py
    - Drop read_recent_events() (only consumer was _serve_m2_grounding;
      now dead code per Jin's "avoid bloat unless product-justified").
    - Drop now-unused `time` import.

  tests/test_m2_grounding_log.py
    - Drop test_read_recent_events_respects_window (function gone) and
      now-unused `os` import.

Added (the new piece)
---------------------

  tests/eval_grounding_recall_summary.py (new)
    Renders the PR-2 eval JSON (test-results/m2-grounding-recall.json)
    as a markdown block — precision / recall / abort-rate scoreboard,
    outcome breakdown, per-case-type recall table, gate-breach line,
    expandable miss-list capped at 25 rows. Fail-quiet: missing/malformed
    JSON degrades to a one-line note rather than failing CI.

  .github/workflows/test-mcp-regression.yml (+10)
    New "M2 metrics summary" step after the M2 eval. Pipes the
    renderer's stdout to $GITHUB_STEP_SUMMARY so the metrics show on
    the GitHub Actions run page without needing the artifact download.
    always() guard so the summary appears even when the eval step
    above warns. continue-on-error keeps it advisory.

Kept from PR-3's initial shape
------------------------------

  - The three PostHog events from handle_bind / handle_resolve_compliance.
  - The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl
    (operator support + diagnose CLI surface; never relayed).
  - The m2_grounding_log.py module's record_attempt / record_ratification
    public API.
  - All telemetry tests (privacy invariant pin still holds).

Net Δ on PR-3: -119 LOC dashboard pieces, +210 LOC summary renderer
+ workflow step. Tests: 11 passed, 3 skipped (resolve_compliance
import-or-skip). Ruff + ruff format + mypy all green.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@silongtan

Copy link
Copy Markdown
Collaborator Author

Updated per Jin's note — the operator dashboard is for users; M2 grounding precision is an engineering quality metric and belongs on GitHub instead.

Reverted

  • assets/dashboard.html — dropped the <section id="m2-panel"> block + the renderM2 / loadM2 JS. Dashboard returns to its pre-fix(tool,skill): caller-LLM grounding produces incorrect decision bindings (M2 regression) #280 user view.
  • dashboard/server.py — dropped the GET /m2_grounding route + _serve_m2_grounding handler.
  • m2_grounding_log.py — dropped read_recent_events() (only consumer was the dashboard endpoint; now dead code).
  • Companion test removed; unused os / time imports cleaned up.

Added

  • tests/eval_grounding_recall_summary.py — renders the PR-2 eval JSON as a GitHub-Actions step-summary markdown block. Precision / recall / abort-rate scoreboard, outcome breakdown, per-case-type recall, gate-breach line, expandable miss-list capped at 25 rows. Fail-quiet on missing/malformed JSON.
  • New CI step in test-mcp-regression.yml: M2 metrics summary runs after the eval and pipes the renderer's stdout to $GITHUB_STEP_SUMMARY. Engineers see the numbers on the run page without downloading the artifact.

Kept (no change to telemetry contract)

  • The three m2_grounding_* PostHog events from handle_bind / handle_resolve_compliance.
  • The privacy-preserving local mirror at ~/.bicameral/m2_grounding.jsonl (still useful for operator-support / bicameral-mcp diagnose; decision_id still local-only, never relayed).
  • All telemetry tests, including the privacy-invariant pin.

Net Δ on this PR: −119 LOC dashboard pieces, +210 LOC summary renderer + workflow step. Tests still 11 passed / 3 skipped (resolve_compliance import-or-skip), ruff + format + mypy clean.

The previous revert left an extra blank line where the `<section
id="m2-panel">` block lived. Removes it so assets/dashboard.html is
byte-identical to origin/dev — confirming Jin's "don't change the
user dashboard" intent verbatim.

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@silongtan

Copy link
Copy Markdown
Collaborator Author

Dashboard.html now byte-identical to origin/devgit diff origin/dev -- assets/dashboard.html returns 0. The previous revert left a leftover blank line where the <section> block lived; cleaned that up.

Net change to the user dashboard from this PR: zero. Confirms Jin's "the dashboard is for users" intent verbatim.

Pushed as e929a42.

@jinhongkuan jinhongkuan merged commit 58f0efa into dev May 9, 2026
9 of 10 checks passed
@silongtan silongtan deleted the 280-grounding-telemetry branch May 9, 2026 03:01
jinhongkuan pushed a commit that referenced this pull request May 9, 2026
Triages 25 dev commits onto main (already on dev as of merge time):
  • #289 — team-mode remote event-log adapter (#277)
  • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280)
  • #275 — README/SECURITY surface
  • plus assorted fixes flowing through dev

Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block,
inserted v0.14.2's release entry from main below it, then renamed
[Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode
section + extended setup-writes table from #289 — main was missing
both because PR #289 hadn't backflowed yet).

pyproject.toml: 0.14.2 → 0.14.3
RECOMMENDED_VERSION: 0.14.1 → 0.14.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin pushed a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…ne (BicameralAI#280)

PR BicameralAI#285's first CI run produced a clean baseline:

  23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000
  ✓ all gates pass

That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30).
Locking the baseline in before drift sets in.

Two changes to .github/workflows/test-mcp-regression.yml:

  1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero
     on breach instead of warning to step output.

  2. Removed `continue-on-error: true` from the eval step. The step
     now fails CI when the gate breaches. The metrics-summary step
     keeps `continue-on-error: true` so a renderer bug never masks
     the eval result — and the `always()` guard means the breach
     summary is still rendered inline when the eval fails.

After this lands, PRs that touch the bind handler / bind skill /
fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 /
abort_rate ≤ 0.30, OR deliberately re-record the cache by setting
BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change.

Aligns with Jin's "deliberate not drift" framing — same path the M1
eval *should* have taken (M1 has been warn-only forever; M2 is being
flipped while the baseline is fresh, days after the eval shipped).

Refs BicameralAI#280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin pushed a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…M_skill_preflight CI surfacing

Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY.

## Part B — Step-0 invocation harness

The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1.

New files:

- ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table.
- ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern.
- ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance.
- ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven).

Step-0 baseline (Sonnet 4.5, 15 fixtures committed):

| Outcome | Count | Cell |
|---|---|---|
| invoked_history_correctly | 8 | TP |
| skipped_history_should_have | 0 | FN — load-bearing failure mode |
| invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch |
| proceeded_without_fetch | 6 | TN |

| Metric | Value | Gate |
|---|---|---|
| Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ |
| Precision (TP/(TP+FP)) | 88.9% (8/9) | — |
| FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ |

The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8).

The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals.

## Part C — CI surfacing

Three new CLI runners + one summary renderer mirror the M2/M6 pattern:

- ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown).
- ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate).
- ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304).

Workflow wiring:

- ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true.
- ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern).

## Cache discipline

- 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``.
- CI runs cache-hits-only after merge.
- Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA).

## Sample-size note

15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in.

## Acceptance touched

- [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA).
- [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip.
- [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``.
- [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators).
- [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply).
- [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs.

## Verification

```
$ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py
8 passed, 15 skipped (no API key; fixtures will hit cache on CI)

$ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py
15/16 pass (1 expected FP on S0_dark_mode — warn-only)

$ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json
$ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json
$ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json
[renders the full M_skill_preflight markdown block]

$ ruff check + format + mypy → clean
```

## Test plan

- [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml
- [ ] M_skill_preflight block visible on the run summary page after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants