test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline#396
Merged
Merged
Conversation
…+ record baselines Grows tests/eval/preflight_skill_dataset.jsonl to 25 hand-curated rows balanced 8/8/6/3 across the four axes per the #306 spec: | Axis | Rows | Submode coverage | |---|---|---| | miss / vocab_mismatch | 8 | 2 synonym pairs, 3 abstraction shifts (policy ↔ impl), 2 cross-domain, 1 original M1 | | miss / ungrounded | 8 | 3 policy (PII, dual-control, retention), 3 cross-cutting (logging, error-handling, observability), 1 edge (audit), 1 original M4 | | false_fire / irrelevant_drilling | 6 | 1 original FF1, 1 dark-mode, 2 dep bumps, 2 docs-only — strongest negative controls per OpenAI's implicit-invocation pattern | | correct / direct_match | 3 | sanity anchors — topic literally names the feature | Each new row carries the required `note` field naming the specific failure submode. Baseline (Sonnet 4.5, cached fixtures committed): | Axis | Recall | n | |---|---|---| | M1 vocab_mismatch | 100% | 8/8 | | M4 ungrounded | 100% | 8/8 | | FF1 false_fire | 100% | 6/6 (zero over-pick) | | D direct_match | 100% | 3/3 | | Overall | 100% | 25/25 | This is the **skill-layer Step-1 baseline** — given a pre-fetched bicameral.history() payload, the LLM correctly identifies relevant feature groups across all four axes. It reframes the #58 Phase A finding: handler 0% recall on vocab + unbound is "by design" (no BM25); skill layer recovers to 100% when it gets to reason over the full ledger. Part B's Step-0 invocation eval (separate PR) is now the load-bearing question — if the agent doesn't actually call bicameral.history() when the handler returns empty, this 100% recall is moot. Cache discipline: 25 fixtures recorded under tests/eval/fixtures/skill_judge/, keyed on SHA(model | skill_sha | input_sha) per the established _bind_judge pattern. CI runs cache-hits-only after this lands; re-record with BICAMERAL_PREFLIGHT_EVAL_RECORD=1 when the skill prompt changes. Sample size note: 25 rows defends ~15pp recall differences with 80% power per Anthropic's statistical approach to evals. Tighter claims (5pp) will need Part B's expansion to ~50 rows or a longer baseline horizon. Acceptance touched: - [x] preflight_skill_dataset.jsonl contains ≥25 rows balanced 8/8/6/3 - [x] Each row has a one-line `note` field documenting the failure submode - [x] Baseline numbers recorded (this PR's body + commit message) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This was referenced May 16, 2026
Merged
Knapp-Kevin
pushed a commit
to Knapp-Kevin/bicameral-mcp
that referenced
this pull request
May 21, 2026
…M_skill_preflight CI surfacing Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY. ## Part B — Step-0 invocation harness The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1. New files: - ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table. - ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern. - ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance. - ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven). Step-0 baseline (Sonnet 4.5, 15 fixtures committed): | Outcome | Count | Cell | |---|---|---| | invoked_history_correctly | 8 | TP | | skipped_history_should_have | 0 | FN — load-bearing failure mode | | invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch | | proceeded_without_fetch | 6 | TN | | Metric | Value | Gate | |---|---|---| | Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ | | Precision (TP/(TP+FP)) | 88.9% (8/9) | — | | FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ | The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8). The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals. ## Part C — CI surfacing Three new CLI runners + one summary renderer mirror the M2/M6 pattern: - ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown). - ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate). - ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304). Workflow wiring: - ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true. - ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern). ## Cache discipline - 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``. - CI runs cache-hits-only after merge. - Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA). ## Sample-size note 15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in. ## Acceptance touched - [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA). - [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip. - [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``. - [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators). - [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply). - [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs. ## Verification ``` $ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py 8 passed, 15 skipped (no API key; fixtures will hit cache on CI) $ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py 15/16 pass (1 expected FP on S0_dark_mode — warn-only) $ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json $ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json $ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json [renders the full M_skill_preflight markdown block] $ ruff check + format + mypy → clean ``` ## Test plan - [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml - [ ] M_skill_preflight block visible on the run summary page after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes Part A of #306. Grows
tests/eval/preflight_skill_dataset.jsonlfrom 3 → 25 rows balanced 8/8/6/3 across the four axes the issue calls out, and records the Sonnet 4.5 Step-1 baseline.Dataset shape
Hand-curated; one new failure submode per row. Schema unchanged (
id/axis/title/topic/ledger/expect_relevant/expect_strict_irrelevant/note). Existing 3 rows preserved verbatim.miss / vocab_mismatchthrottling↔rate-limit,soft-delete↔archive) · 3 abstraction shifts (GDPR↔cron,trust-boundary↔validator,perf-SLA↔cache-warm) · 2 cross-domain (SOC2↔countdown-UI,CDN↔FMP) · 1 original M1miss / ungroundedfalse_fire / irrelevant_drillingreact-querybump ·prettierbump · CONTRIBUTING update · README typo · 1 original FF1 — strongest negative controls per OpenAI's implicit-invocation patterncorrect / direct_matchBaseline (Sonnet 4.5, cached fixtures committed)
M1vocab_mismatchM4ungroundedFF1false_fireDdirect_matchWhat this baseline reframes from #58
The #58 Phase A handler-layer baseline showed 0% recall on vocab and unbound — "by design" because v0.10.0 deliberately removed BM25 from
handlers/preflight.py. This skill-layer baseline shows the LLM recovers to 100% when handed the fullbicameral.history()payload.That makes #306 Part B (the Step-0 invocation eval) the load-bearing question: if the agent doesn't actually elect to call
bicameral.history()when the handler returns empty, this 100% Step-1 recall is moot. Part B is a separate PR (next).Cache discipline
tests/eval/fixtures/skill_judge/, keyed onSHA(model | skill_sha | input_sha)per the established_bind_judgepattern.BICAMERAL_PREFLIGHT_EVAL_RECORD=1when the skill prompt changes (the cache key is invalidated automatically by the SKILL.md SHA component).Sample size note
25 rows defends ~15pp recall differences with 80% power per Anthropic's statistical approach to evals. Tighter claims (5pp differences) need ~50 rows; expansion gated on whether Part B reveals a Step-0 invocation gap that's the actual bottleneck.
Acceptance touched
preflight_skill_dataset.jsonlcontains ≥25 rows balanced 8/8/6/3 across the four axesnotefield documenting the failure submode_skill_invocation_judge.py+ invocation dataset) — Part B, separate PReval_preflight_skill_summary.py+M_skill_preflightblock) — Part C, bundled with Part BVerification
Test plan
eval_preflight_m6_summary.py,eval_grounding_recall_summary.py) to confirm Part C's renderer slot🤖 Generated with Claude Code