test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline by silongtan · Pull Request #396 · BicameralAI/bicameral-mcp

silongtan · 2026-05-16T20:12:29Z

Summary

Closes Part A of #306. Grows tests/eval/preflight_skill_dataset.jsonl from 3 → 25 rows balanced 8/8/6/3 across the four axes the issue calls out, and records the Sonnet 4.5 Step-1 baseline.

Dataset shape

Hand-curated; one new failure submode per row. Schema unchanged (id / axis / title / topic / ledger / expect_relevant / expect_strict_irrelevant / note). Existing 3 rows preserved verbatim.

Axis	Rows	Submode coverage
`miss / vocab_mismatch`	8	2 synonym pairs (`throttling↔rate-limit`, `soft-delete↔archive`) · 3 abstraction shifts (`GDPR↔cron`, `trust-boundary↔validator`, `perf-SLA↔cache-warm`) · 2 cross-domain (`SOC2↔countdown-UI`, `CDN↔FMP`) · 1 original M1
`miss / ungrounded`	8	3 policy (PII redaction, dual-control admin, raw-event retention) · 3 cross-cutting (trace-ID propagation, retry policy, p95 observability) · 1 edge (system-action audit) · 1 original M4
`false_fire / irrelevant_drilling`	6	dark-mode toggle · `react-query` bump · `prettier` bump · CONTRIBUTING update · README typo · 1 original FF1 — strongest negative controls per OpenAI's implicit-invocation pattern
`correct / direct_match`	3	billing-proration · JWT · permission-scope — sanity anchors

Baseline (Sonnet 4.5, cached fixtures committed)

Axis	Recall	n	Notes
`M1` vocab_mismatch	100%	8/8	Synonym + abstraction + cross-domain all bridged
`M4` ungrounded	100%	8/8	Policy + cross-cutting + edge all surfaced
`FF1` false_fire	100%	6/6	Zero over-pick on the negative controls
`D` direct_match	100%	3/3	Sanity anchors hold
Overall	100%	25/25

What this baseline reframes from #58

The #58 Phase A handler-layer baseline showed 0% recall on vocab and unbound — "by design" because v0.10.0 deliberately removed BM25 from handlers/preflight.py. This skill-layer baseline shows the LLM recovers to 100% when handed the full bicameral.history() payload.

That makes #306 Part B (the Step-0 invocation eval) the load-bearing question: if the agent doesn't actually elect to call bicameral.history() when the handler returns empty, this 100% Step-1 recall is moot. Part B is a separate PR (next).

Cache discipline

25 fixtures recorded under tests/eval/fixtures/skill_judge/, keyed on SHA(model | skill_sha | input_sha) per the established _bind_judge pattern.
CI runs cache-hits-only after merge.
Re-record locally with BICAMERAL_PREFLIGHT_EVAL_RECORD=1 when the skill prompt changes (the cache key is invalidated automatically by the SKILL.md SHA component).

Sample size note

25 rows defends ~15pp recall differences with 80% power per Anthropic's statistical approach to evals. Tighter claims (5pp differences) need ~50 rows; expansion gated on whether Part B reveals a Step-0 invocation gap that's the actual bottleneck.

Acceptance touched

preflight_skill_dataset.jsonl contains ≥25 rows balanced 8/8/6/3 across the four axes
Each row has a one-line note field documenting the failure submode
Baseline recorded (this PR description)
Step-0 harness (_skill_invocation_judge.py + invocation dataset) — Part B, separate PR
CI step-summary (eval_preflight_skill_summary.py + M_skill_preflight block) — Part C, bundled with Part B

Verification

$ BICAMERAL_PREFLIGHT_EVAL_RECORD=1 ANTHROPIC_API_KEY=... \
    pytest tests/eval/run_preflight_skill_eval.py -v
25 passed in 58.23s    # initial record (~$0.02 of API spend)

$ pytest tests/eval/run_preflight_skill_eval.py -v   # without key, cached
25 passed                                            # cache-hits-only path

Test plan

CI green (cache-hits-only)
Audit existing M-tier eval summary surfaces (eval_preflight_m6_summary.py, eval_grounding_recall_summary.py) to confirm Part C's renderer slot

🤖 Generated with Claude Code

…+ record baselines Grows tests/eval/preflight_skill_dataset.jsonl to 25 hand-curated rows balanced 8/8/6/3 across the four axes per the #306 spec: | Axis | Rows | Submode coverage | |---|---|---| | miss / vocab_mismatch | 8 | 2 synonym pairs, 3 abstraction shifts (policy ↔ impl), 2 cross-domain, 1 original M1 | | miss / ungrounded | 8 | 3 policy (PII, dual-control, retention), 3 cross-cutting (logging, error-handling, observability), 1 edge (audit), 1 original M4 | | false_fire / irrelevant_drilling | 6 | 1 original FF1, 1 dark-mode, 2 dep bumps, 2 docs-only — strongest negative controls per OpenAI's implicit-invocation pattern | | correct / direct_match | 3 | sanity anchors — topic literally names the feature | Each new row carries the required `note` field naming the specific failure submode. Baseline (Sonnet 4.5, cached fixtures committed): | Axis | Recall | n | |---|---|---| | M1 vocab_mismatch | 100% | 8/8 | | M4 ungrounded | 100% | 8/8 | | FF1 false_fire | 100% | 6/6 (zero over-pick) | | D direct_match | 100% | 3/3 | | Overall | 100% | 25/25 | This is the **skill-layer Step-1 baseline** — given a pre-fetched bicameral.history() payload, the LLM correctly identifies relevant feature groups across all four axes. It reframes the #58 Phase A finding: handler 0% recall on vocab + unbound is "by design" (no BM25); skill layer recovers to 100% when it gets to reason over the full ledger. Part B's Step-0 invocation eval (separate PR) is now the load-bearing question — if the agent doesn't actually call bicameral.history() when the handler returns empty, this 100% recall is moot. Cache discipline: 25 fixtures recorded under tests/eval/fixtures/skill_judge/, keyed on SHA(model | skill_sha | input_sha) per the established _bind_judge pattern. CI runs cache-hits-only after this lands; re-record with BICAMERAL_PREFLIGHT_EVAL_RECORD=1 when the skill prompt changes. Sample size note: 25 rows defends ~15pp recall differences with 80% power per Anthropic's statistical approach to evals. Tighter claims (5pp) will need Part B's expansion to ~50 rows or a longer baseline horizon. Acceptance touched: - [x] preflight_skill_dataset.jsonl contains ≥25 rows balanced 8/8/6/3 - [x] Each row has a one-line `note` field documenting the failure submode - [x] Baseline numbers recorded (this PR's body + commit message) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-16T20:12:35Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90e82d2c-7714-421b-b1c4-4e53a675259d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch infra/306-skill-preflight-eval-expand

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…M_skill_preflight CI surfacing Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY. ## Part B — Step-0 invocation harness The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1. New files: - ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table. - ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern. - ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance. - ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven). Step-0 baseline (Sonnet 4.5, 15 fixtures committed): | Outcome | Count | Cell | |---|---|---| | invoked_history_correctly | 8 | TP | | skipped_history_should_have | 0 | FN — load-bearing failure mode | | invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch | | proceeded_without_fetch | 6 | TN | | Metric | Value | Gate | |---|---|---| | Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ | | Precision (TP/(TP+FP)) | 88.9% (8/9) | — | | FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ | The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8). The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals. ## Part C — CI surfacing Three new CLI runners + one summary renderer mirror the M2/M6 pattern: - ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown). - ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate). - ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304). Workflow wiring: - ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true. - ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern). ## Cache discipline - 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``. - CI runs cache-hits-only after merge. - Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA). ## Sample-size note 15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in. ## Acceptance touched - [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA). - [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip. - [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``. - [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators). - [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply). - [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs. ## Verification ``` $ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py 8 passed, 15 skipped (no API key; fixtures will hit cache on CI) $ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py 15/16 pass (1 expected FP on S0_dark_mode — warn-only) $ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json $ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json $ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json [renders the full M_skill_preflight markdown block] $ ruff check + format + mypy → clean ``` ## Test plan - [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml - [ ] M_skill_preflight block visible on the run summary page after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan temporarily deployed to ci-test May 16, 2026 20:12 — with GitHub Actions Inactive

silongtan merged commit 9904658 into dev May 16, 2026
10 checks passed

silongtan deleted the infra/306-skill-preflight-eval-expand branch May 16, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline#396

test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline#396
silongtan merged 1 commit into
devfrom
infra/306-skill-preflight-eval-expand

silongtan commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silongtan commented May 16, 2026

Summary

Dataset shape

Baseline (Sonnet 4.5, cached fixtures committed)

What this baseline reframes from #58

Cache discipline

Sample size note

Acceptance touched

Verification

Test plan

Uh oh!

coderabbitai Bot commented May 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant