Skip to content

test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline#396

Merged
silongtan merged 1 commit into
devfrom
infra/306-skill-preflight-eval-expand
May 16, 2026
Merged

test(eval): #306 Part A — expand preflight_skill_dataset 3 → 25 rows + 100% baseline#396
silongtan merged 1 commit into
devfrom
infra/306-skill-preflight-eval-expand

Conversation

@silongtan

Copy link
Copy Markdown
Collaborator

Summary

Closes Part A of #306. Grows tests/eval/preflight_skill_dataset.jsonl from 3 → 25 rows balanced 8/8/6/3 across the four axes the issue calls out, and records the Sonnet 4.5 Step-1 baseline.

Dataset shape

Hand-curated; one new failure submode per row. Schema unchanged (id / axis / title / topic / ledger / expect_relevant / expect_strict_irrelevant / note). Existing 3 rows preserved verbatim.

Axis Rows Submode coverage
miss / vocab_mismatch 8 2 synonym pairs (throttling↔rate-limit, soft-delete↔archive) · 3 abstraction shifts (GDPR↔cron, trust-boundary↔validator, perf-SLA↔cache-warm) · 2 cross-domain (SOC2↔countdown-UI, CDN↔FMP) · 1 original M1
miss / ungrounded 8 3 policy (PII redaction, dual-control admin, raw-event retention) · 3 cross-cutting (trace-ID propagation, retry policy, p95 observability) · 1 edge (system-action audit) · 1 original M4
false_fire / irrelevant_drilling 6 dark-mode toggle · react-query bump · prettier bump · CONTRIBUTING update · README typo · 1 original FF1 — strongest negative controls per OpenAI's implicit-invocation pattern
correct / direct_match 3 billing-proration · JWT · permission-scope — sanity anchors

Baseline (Sonnet 4.5, cached fixtures committed)

Axis Recall n Notes
M1 vocab_mismatch 100% 8/8 Synonym + abstraction + cross-domain all bridged
M4 ungrounded 100% 8/8 Policy + cross-cutting + edge all surfaced
FF1 false_fire 100% 6/6 Zero over-pick on the negative controls
D direct_match 100% 3/3 Sanity anchors hold
Overall 100% 25/25

What this baseline reframes from #58

The #58 Phase A handler-layer baseline showed 0% recall on vocab and unbound — "by design" because v0.10.0 deliberately removed BM25 from handlers/preflight.py. This skill-layer baseline shows the LLM recovers to 100% when handed the full bicameral.history() payload.

That makes #306 Part B (the Step-0 invocation eval) the load-bearing question: if the agent doesn't actually elect to call bicameral.history() when the handler returns empty, this 100% Step-1 recall is moot. Part B is a separate PR (next).

Cache discipline

  • 25 fixtures recorded under tests/eval/fixtures/skill_judge/, keyed on SHA(model | skill_sha | input_sha) per the established _bind_judge pattern.
  • CI runs cache-hits-only after merge.
  • Re-record locally with BICAMERAL_PREFLIGHT_EVAL_RECORD=1 when the skill prompt changes (the cache key is invalidated automatically by the SKILL.md SHA component).

Sample size note

25 rows defends ~15pp recall differences with 80% power per Anthropic's statistical approach to evals. Tighter claims (5pp differences) need ~50 rows; expansion gated on whether Part B reveals a Step-0 invocation gap that's the actual bottleneck.

Acceptance touched

  • preflight_skill_dataset.jsonl contains ≥25 rows balanced 8/8/6/3 across the four axes
  • Each row has a one-line note field documenting the failure submode
  • Baseline recorded (this PR description)
  • Step-0 harness (_skill_invocation_judge.py + invocation dataset) — Part B, separate PR
  • CI step-summary (eval_preflight_skill_summary.py + M_skill_preflight block) — Part C, bundled with Part B

Verification

$ BICAMERAL_PREFLIGHT_EVAL_RECORD=1 ANTHROPIC_API_KEY=... \
    pytest tests/eval/run_preflight_skill_eval.py -v
25 passed in 58.23s    # initial record (~$0.02 of API spend)

$ pytest tests/eval/run_preflight_skill_eval.py -v   # without key, cached
25 passed                                            # cache-hits-only path

Test plan

  • CI green (cache-hits-only)
  • Audit existing M-tier eval summary surfaces (eval_preflight_m6_summary.py, eval_grounding_recall_summary.py) to confirm Part C's renderer slot

🤖 Generated with Claude Code

…+ record baselines

Grows tests/eval/preflight_skill_dataset.jsonl to 25 hand-curated rows
balanced 8/8/6/3 across the four axes per the #306 spec:

| Axis | Rows | Submode coverage |
|---|---|---|
| miss / vocab_mismatch | 8 | 2 synonym pairs, 3 abstraction shifts (policy ↔ impl), 2 cross-domain, 1 original M1 |
| miss / ungrounded | 8 | 3 policy (PII, dual-control, retention), 3 cross-cutting (logging, error-handling, observability), 1 edge (audit), 1 original M4 |
| false_fire / irrelevant_drilling | 6 | 1 original FF1, 1 dark-mode, 2 dep bumps, 2 docs-only — strongest negative controls per OpenAI's implicit-invocation pattern |
| correct / direct_match | 3 | sanity anchors — topic literally names the feature |

Each new row carries the required `note` field naming the specific failure submode.

Baseline (Sonnet 4.5, cached fixtures committed):

| Axis | Recall | n |
|---|---|---|
| M1 vocab_mismatch | 100% | 8/8 |
| M4 ungrounded | 100% | 8/8 |
| FF1 false_fire | 100% | 6/6 (zero over-pick) |
| D direct_match | 100% | 3/3 |
| Overall | 100% | 25/25 |

This is the **skill-layer Step-1 baseline** — given a pre-fetched bicameral.history()
payload, the LLM correctly identifies relevant feature groups across all four axes.

It reframes the #58 Phase A finding: handler 0% recall on vocab + unbound is
"by design" (no BM25); skill layer recovers to 100% when it gets to reason
over the full ledger. Part B's Step-0 invocation eval (separate PR) is now
the load-bearing question — if the agent doesn't actually call
bicameral.history() when the handler returns empty, this 100% recall is moot.

Cache discipline: 25 fixtures recorded under tests/eval/fixtures/skill_judge/,
keyed on SHA(model | skill_sha | input_sha) per the established _bind_judge
pattern. CI runs cache-hits-only after this lands; re-record with
BICAMERAL_PREFLIGHT_EVAL_RECORD=1 when the skill prompt changes.

Sample size note: 25 rows defends ~15pp recall differences with 80% power
per Anthropic's statistical approach to evals. Tighter claims (5pp) will
need Part B's expansion to ~50 rows or a longer baseline horizon.

Acceptance touched:
- [x] preflight_skill_dataset.jsonl contains ≥25 rows balanced 8/8/6/3
- [x] Each row has a one-line `note` field documenting the failure submode
- [x] Baseline numbers recorded (this PR's body + commit message)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90e82d2c-7714-421b-b1c4-4e53a675259d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch infra/306-skill-preflight-eval-expand

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@silongtan silongtan merged commit 9904658 into dev May 16, 2026
10 checks passed
@silongtan silongtan deleted the infra/306-skill-preflight-eval-expand branch May 16, 2026 20:16
Knapp-Kevin pushed a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…M_skill_preflight CI surfacing

Closes Parts B and C of BicameralAI#306. Part A (dataset 3→25 + Step-1 baseline) shipped in PR BicameralAI#396; this PR adds the upstream measurement Part A's relevance eval can't see and wires both into CI's GITHUB_STEP_SUMMARY.

## Part B — Step-0 invocation harness

The "implicit tool invocation" failure pattern (OpenAI's eval-skills guidance): does the agent elect to call ``bicameral.history()`` when the preflight handler returns empty? Part A's 100% Step-1 recall is moot if the agent never reaches Step-1.

New files:

- ``tests/eval/_skill_invocation_judge.py`` — multi-turn tool-use harness modeled on ``_bind_judge.py``. Exposes ``bicameral_history`` + ``submit_decision_to_proceed`` as tool defs. Same x-api-key auth, same retry envelope (3 attempts / 2-8-32s backoff), same fixture cache discipline (SHA(model | skill_sha | input_sha)). MAX_TURNS=4. Pure outcome classifier ``classify_outcome(should_invoke, invoked)`` lives at module scope so the summary renderer + sociable tests share one truth table.
- ``tests/eval/preflight_skill_invocation_dataset.jsonl`` — 15 hand-curated rows balanced 8 should_invoke / 7 should_skip. Should-invoke cases seed vocab-mismatch / ungrounded / cross-cutting policy decisions that only ``bicameral.history()`` can surface. Should-skip cases are negative controls (dark mode, dep bumps, README typo, etc.) per OpenAI's implicit-invocation testing pattern.
- ``tests/eval/run_preflight_skill_invocation_eval.py`` — pytest runner with skip-clean-without-cache-or-key. Schema sanity test enforces the 8/7 balance.
- ``tests/test_skill_invocation_judge.py`` — sociable unit tests for the 2x2 outcome classifier per CLAUDE.md (no MagicMock, table-driven).

Step-0 baseline (Sonnet 4.5, 15 fixtures committed):

| Outcome | Count | Cell |
|---|---|---|
| invoked_history_correctly | 8 | TP |
| skipped_history_should_have | 0 | FN — load-bearing failure mode |
| invoked_history_unnecessarily | 1 (S0_dark_mode) | FP — over-fetch |
| proceeded_without_fetch | 6 | TN |

| Metric | Value | Gate |
|---|---|---|
| Recall (should-invoke axis) | 100.0% (8/8) | ≥ 50% ✅ |
| Precision (TP/(TP+FP)) | 88.9% (8/9) | — |
| FP rate (over-fetch) | 14.3% (1/7) | ≤ 30% ✅ |

The single FP — S0_dark_mode — fetched history "to check for cross-cutting decisions on theming, styling, or UI state management". A reasonable instinct against the strict ground truth, but counts as wasted tokens. Worth surfacing as a soft signal; not severe enough to file the SKILL.md strengthening followup BicameralAI#306 calls out (the FN floor stays clean at 0/8).

The 100% should-invoke recall reframes the BicameralAI#58 architectural question on path C: the v0.10.0 split (handler structural, skill LLM-over-history) works at the Step-0 layer too, not just at Step-1. Combined with Part A's 100% Step-1 recall, the skill-layer architecture is empirically sound on synthetic cases. Confidence interval is the next surface — 15 rows defends ~15pp differences with 80% power per Anthropic's statistical approach to evals.

## Part C — CI surfacing

Three new CLI runners + one summary renderer mirror the M2/M6 pattern:

- ``tests/eval_preflight_skill_step1.py`` — CLI runner that drives ``_skill_judge.judge_relevance`` over the 25-row Part A dataset and emits aggregate JSON (per-axis recall + breakdown).
- ``tests/eval_preflight_skill_invocation.py`` — CLI runner that drives ``_skill_invocation_judge.run_invocation_judgment`` over the 15-row Step-0 dataset and emits aggregate JSON (2x2 confusion matrix + recall/precision/fp_rate).
- ``tests/eval_preflight_skill_summary.py`` — reads both JSONs and renders one combined markdown block (per-axis recall table + invocation 2x2 + FN miss list) to stdout. Fail-quiet on missing JSON. Mirrors ``eval_grounding_recall_summary.py`` (BicameralAI#285) and ``eval_preflight_m6_summary.py`` (BicameralAI#304).

Workflow wiring:

- ``.github/workflows/preflight-eval.yml`` — new Phase 2b step running the Step-0 pytest runner with cached fixtures + ANTHROPIC_API_KEY secret fallback. continue-on-error: true.
- ``.github/workflows/test-mcp-regression.yml`` — new M_skill_preflight block alongside M2 (line ~231) and M6 (line ~262). Runs both CLI runners with --gate-mode warn, then renders the combined summary to $GITHUB_STEP_SUMMARY. Promote to --gate-mode hard in a followup PR after one stable run (matches BicameralAI#288 M2 warn→hard pattern).

## Cache discipline

- 15 Step-0 fixtures committed under ``tests/eval/fixtures/skill_invocation_judge/``.
- CI runs cache-hits-only after merge.
- Re-record locally with ``BICAMERAL_PREFLIGHT_INVOCATION_EVAL_RECORD=1`` when the bicameral-preflight SKILL.md prompt changes (cache key includes SKILL.md SHA).

## Sample-size note

15 rows defends ~15pp differences with 80% power. Tighter claims (5pp) need ~50 rows. Expansion gated on whether the warn-only signal reveals real drift worth investing in.

## Acceptance touched

- [x] ``tests/eval/_skill_invocation_judge.py`` exists + invoked by ``tests/eval/run_preflight_skill_invocation_eval.py``. Follows the ``_bind_judge.py`` pattern (multi-turn loop + httpx retry + fixture caching keyed on SKILL.md SHA).
- [x] ``tests/eval/preflight_skill_invocation_dataset.jsonl`` contains 15 rows balanced 8/7 across should_invoke / should_skip.
- [x] ``.github/workflows/preflight-eval.yml`` Phase 2b surfaces the new step alongside the existing skill eval. ``continue-on-error: true``.
- [x] Sociable tests per CLAUDE.md (SimpleNamespace-equivalent: real dataclasses + table-driven, no MagicMock for shipped collaborators).
- [x] Baseline numbers recorded (this PR body + BicameralAI#306 first reply).
- [ ] Step-0 invocation rate < 50% on should-invoke axis → file SKILL.md strengthening followup. **NOT triggered** — baseline shows 100% recall, no FNs.

## Verification

```
$ pytest tests/test_skill_invocation_judge.py tests/eval/run_preflight_skill_invocation_eval.py
8 passed, 15 skipped (no API key; fixtures will hit cache on CI)

$ ANTHROPIC_API_KEY=... pytest tests/eval/run_preflight_skill_invocation_eval.py
15/16 pass (1 expected FP on S0_dark_mode — warn-only)

$ python tests/eval_preflight_skill_step1.py -o /tmp/s1.json
$ python tests/eval_preflight_skill_invocation.py -o /tmp/s0.json
$ python tests/eval_preflight_skill_summary.py --step1 /tmp/s1.json --step0 /tmp/s0.json
[renders the full M_skill_preflight markdown block]

$ ruff check + format + mypy → clean
```

## Test plan

- [ ] CI green (cache-hits-only) on both preflight-eval.yml and test-mcp-regression.yml
- [ ] M_skill_preflight block visible on the run summary page after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant