evals: --benchmark first-class CLI + Benchmark interface#32307
Conversation
Each public benchmark we run (LongMemEval-V2 next) lives as a peer of personal-intelligence under benchmarks/<id>/. This PR introduces the contract: a Benchmark manifest declares displayName, unitDirName, and unitNoun, and `evals run` accepts a first-class --benchmark flag plus a generic --filter for unit selection. - Add benchmarks/personal-intelligence/manifest.json declaring unitDirName=tests and unitNoun=test, so this benchmark plays by the same rules as the public ones it sits alongside. - Add src/lib/benchmark.ts with BenchmarkManifestSchema + loadBenchmark(id). - Extend src/lib/catalog.ts with getBenchmarksDir, listBenchmarkIds, and listBenchmarkUnitIds. Keep getTestsDir/listTestIds as back-compat shortcuts to the personal-intelligence units dir for legacy callers. - Extend loadTestDef(id, unitsDir?) so callers can resolve units against any benchmark; default to getTestsDir() preserves existing behavior. - `evals run` now takes --benchmark (default personal-intelligence) and --filter <ids> (optional — omit to run every unit). The previous required --tests flag is accepted as a deprecated alias with a one-line stderr warn. Passing both --filter and a conflicting --tests rejects with an explicit error rather than silently picking one. - New `evals benchmarks list` surfaces id, displayName, unitNoun, and unit count for each benchmark. - README + AGENTS.md updated to document the manifest schema, the new flag shape, and the legacy aliases. No new benchmark wired yet — that's PR 3.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aed739678c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ); | ||
| } | ||
| const tests = await Promise.all( | ||
| unitIds.map((id) => loadTestDef(id, benchmark.unitsDir)), |
There was a problem hiding this comment.
Preserve EVALS_TESTS_DIR for legacy --tests runs
When callers use the deprecated --tests path with the default personal-intelligence benchmark, this now passes benchmark.unitsDir into loadTestDef, which bypasses the legacy getTestsDir() default and therefore ignores EVALS_TESTS_DIR. Existing local scripts that set EVALS_TESTS_DIR to run an out-of-tree/custom test directory and invoke evals run --profiles ... --tests ... will now look under the committed benchmarks/personal-intelligence/tests directory and fail with a missing SPEC.md, even though evals tests list still honors the override. Consider routing the legacy --tests case through getTestsDir() or teaching loadBenchmark("personal-intelligence") to preserve the override.
Useful? React with 👍 / 👎.
PR 3 of the LongMemEval-V2 integration workstream. - benchmarks/longmemeval-v2/manifest.json declares the benchmark to the harness; conforms to BenchmarkManifestSchema from #32307. - README.md + data/download.sh + data/.gitignore document and automate fetching the 7+ GB dataset from Hugging Face. - src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json and returns BenchmarkItem[] with a strict join (missing haystack entries raise instead of silently dropping). zod schema uses .passthrough() so forward-compat schema additions don't break the loader. - 8 fixture-backed loader tests cover happy paths and every failure mode. - evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/** so per-benchmark TS code is in typecheck + dead-export scope. The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and the aggregator + HTML report land in subsequent PRs against the contract established here. Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>
Summary
Make
--benchmarka first-class CLI flag so we can run public memory benchmarks (LongMemEval-V2 next) alongside the in-house personal-intelligence suite. Personal-Intelligence is no longer privileged — it's a peer of every public benchmark we run, with its own manifest declaring the directory and noun it uses for individual units.What changes
benchmarks/personal-intelligence/manifest.json— first concrete manifest. DeclaresdisplayName="Personal Intelligence",unitDirName="tests",unitNoun="test".src/lib/benchmark.ts—BenchmarkManifestSchema+loadBenchmark(id)returning{ id, manifest, unitsDir }. Same zod pattern as profile.src/lib/catalog.ts— addsgetBenchmarksDir,listBenchmarkIds,listBenchmarkUnitIds(unitsDir). KeepsgetTestsDir/listTestIdsas back-compat shortcuts to the personal-intelligence units dir for legacy callers and theevals tests listsurface.src/lib/test-def.ts—loadTestDef(id, unitsDir?)accepts an optional units root; defaults togetTestsDir()so existing callers keep working.evals run— new--benchmark <id>(defaults topersonal-intelligence) and--filter <ids>(omit to run every unit). The required--tests <ids>from before becomes a deprecated alias that prints a one-line stderr warning and routes to--filter. Passing both--filterand a conflicting--testsrejects with an explicit error.evals benchmarks list— new subcommand surfacing id, displayName, unitNoun, and unit count per benchmark.CLI demo
Scope discipline
No new benchmark wired here — LongMemEval-V2 scaffolding lands in PR 3 against the contract this PR establishes. No new pricing/cost concerns, no runner changes, no metric changes.
Test plan
bun run lint— greenbun run typecheck— greenbun test— 194 pass (was 187; +7 new tests acrossbenchmark.test.tsandcatalog.test.ts)benchmarks list,benchmarks list --json,tests list,run --help,--testsdeprecation warning fires,--filter+--testsconflict rejects.Follow-ups
benchmarks/longmemeval-v2/{manifest.json, data/download.sh, src/loader.ts}.evals benchmarks units <id>and retireevals tests list, once a second benchmark exists to motivate the shape.