Skip to content

evals: --benchmark first-class CLI + Benchmark interface#32307

Merged
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmark-cli
May 27, 2026
Merged

evals: --benchmark first-class CLI + Benchmark interface#32307
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmark-cli

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

Summary

Make --benchmark a first-class CLI flag so we can run public memory benchmarks (LongMemEval-V2 next) alongside the in-house personal-intelligence suite. Personal-Intelligence is no longer privileged — it's a peer of every public benchmark we run, with its own manifest declaring the directory and noun it uses for individual units.

What changes

  • benchmarks/personal-intelligence/manifest.json — first concrete manifest. Declares displayName="Personal Intelligence", unitDirName="tests", unitNoun="test".
  • src/lib/benchmark.tsBenchmarkManifestSchema + loadBenchmark(id) returning { id, manifest, unitsDir }. Same zod pattern as profile.
  • src/lib/catalog.ts — adds getBenchmarksDir, listBenchmarkIds, listBenchmarkUnitIds(unitsDir). Keeps getTestsDir / listTestIds as back-compat shortcuts to the personal-intelligence units dir for legacy callers and the evals tests list surface.
  • src/lib/test-def.tsloadTestDef(id, unitsDir?) accepts an optional units root; defaults to getTestsDir() so existing callers keep working.
  • evals run — new --benchmark <id> (defaults to personal-intelligence) and --filter <ids> (omit to run every unit). The required --tests <ids> from before becomes a deprecated alias that prints a one-line stderr warning and routes to --filter. Passing both --filter and a conflicting --tests rejects with an explicit error.
  • evals benchmarks list — new subcommand surfacing id, displayName, unitNoun, and unit count per benchmark.
  • README + AGENTS.md — document the manifest schema, new flag shape, and back-compat aliases.

CLI demo

$ evals benchmarks list
benchmark              display                unit  count
---------------------  ---------------------  ----  -----
personal-intelligence  Personal Intelligence  test  1

$ evals run --profiles vellum-bare --filter timeline-recall
# ... same code path as today, just speaking the new flag vocabulary

$ evals run --profiles vellum-bare --tests timeline-recall
[evals] --tests is deprecated; use --benchmark <id> --filter <ids>.
# ... still runs

Scope discipline

No new benchmark wired here — LongMemEval-V2 scaffolding lands in PR 3 against the contract this PR establishes. No new pricing/cost concerns, no runner changes, no metric changes.

Test plan

  • bun run lint — green
  • bun run typecheck — green
  • bun test — 194 pass (was 187; +7 new tests across benchmark.test.ts and catalog.test.ts)
  • Manual CLI smoke: benchmarks list, benchmarks list --json, tests list, run --help, --tests deprecation warning fires, --filter + --tests conflict rejects.

Follow-ups

  • PR 3: scaffold benchmarks/longmemeval-v2/{manifest.json, data/download.sh, src/loader.ts}.
  • A future PR can unify per-benchmark unit listing into something like evals benchmarks units <id> and retire evals tests list, once a second benchmark exists to motivate the shape.

Each public benchmark we run (LongMemEval-V2 next) lives as a peer of
personal-intelligence under benchmarks/<id>/. This PR introduces the
contract: a Benchmark manifest declares displayName, unitDirName, and
unitNoun, and `evals run` accepts a first-class --benchmark flag plus
a generic --filter for unit selection.

- Add benchmarks/personal-intelligence/manifest.json declaring unitDirName=tests
  and unitNoun=test, so this benchmark plays by the same rules as the public
  ones it sits alongside.
- Add src/lib/benchmark.ts with BenchmarkManifestSchema + loadBenchmark(id).
- Extend src/lib/catalog.ts with getBenchmarksDir, listBenchmarkIds, and
  listBenchmarkUnitIds. Keep getTestsDir/listTestIds as back-compat shortcuts
  to the personal-intelligence units dir for legacy callers.
- Extend loadTestDef(id, unitsDir?) so callers can resolve units against
  any benchmark; default to getTestsDir() preserves existing behavior.
- `evals run` now takes --benchmark (default personal-intelligence) and
  --filter <ids> (optional — omit to run every unit). The previous required
  --tests flag is accepted as a deprecated alias with a one-line stderr warn.
  Passing both --filter and a conflicting --tests rejects with an explicit
  error rather than silently picking one.
- New `evals benchmarks list` surfaces id, displayName, unitNoun, and unit
  count for each benchmark.
- README + AGENTS.md updated to document the manifest schema, the new flag
  shape, and the legacy aliases. No new benchmark wired yet — that's PR 3.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aed739678c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread evals/src/commands/run.ts
);
}
const tests = await Promise.all(
unitIds.map((id) => loadTestDef(id, benchmark.unitsDir)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve EVALS_TESTS_DIR for legacy --tests runs

When callers use the deprecated --tests path with the default personal-intelligence benchmark, this now passes benchmark.unitsDir into loadTestDef, which bypasses the legacy getTestsDir() default and therefore ignores EVALS_TESTS_DIR. Existing local scripts that set EVALS_TESTS_DIR to run an out-of-tree/custom test directory and invoke evals run --profiles ... --tests ... will now look under the committed benchmarks/personal-intelligence/tests directory and fail with a missing SPEC.md, even though evals tests list still honors the override. Consider routing the legacy --tests case through getTestsDir() or teaching loadBenchmark("personal-intelligence") to preserve the override.

Useful? React with 👍 / 👎.

@dvargasfuertes dvargasfuertes merged commit 768563b into main May 27, 2026
5 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/evals-benchmark-cli branch May 27, 2026 21:30
dvargasfuertes pushed a commit that referenced this pull request May 28, 2026
PR 3 of the LongMemEval-V2 integration workstream.

- benchmarks/longmemeval-v2/manifest.json declares the benchmark to the
  harness; conforms to BenchmarkManifestSchema from #32307.
- README.md + data/download.sh + data/.gitignore document and automate
  fetching the 7+ GB dataset from Hugging Face.
- src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json
  and returns BenchmarkItem[] with a strict join (missing haystack entries
  raise instead of silently dropping). zod schema uses .passthrough() so
  forward-compat schema additions don't break the loader.
- 8 fixture-backed loader tests cover happy paths and every failure mode.
- evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/**
  so per-benchmark TS code is in typecheck + dead-export scope.

The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and
the aggregator + HTML report land in subsequent PRs against the contract
established here.

Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant