evals: --benchmark first-class CLI + Benchmark interface by vellum-apollo-bot[bot] · Pull Request #32307 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-27T20:09:24Z

Summary

Make --benchmark a first-class CLI flag so we can run public memory benchmarks (LongMemEval-V2 next) alongside the in-house personal-intelligence suite. Personal-Intelligence is no longer privileged — it's a peer of every public benchmark we run, with its own manifest declaring the directory and noun it uses for individual units.

What changes

benchmarks/personal-intelligence/manifest.json — first concrete manifest. Declares displayName="Personal Intelligence", unitDirName="tests", unitNoun="test".
src/lib/benchmark.ts — BenchmarkManifestSchema + loadBenchmark(id) returning { id, manifest, unitsDir }. Same zod pattern as profile.
src/lib/catalog.ts — adds getBenchmarksDir, listBenchmarkIds, listBenchmarkUnitIds(unitsDir). Keeps getTestsDir / listTestIds as back-compat shortcuts to the personal-intelligence units dir for legacy callers and the evals tests list surface.
src/lib/test-def.ts — loadTestDef(id, unitsDir?) accepts an optional units root; defaults to getTestsDir() so existing callers keep working.
evals run — new --benchmark <id> (defaults to personal-intelligence) and --filter <ids> (omit to run every unit). The required --tests <ids> from before becomes a deprecated alias that prints a one-line stderr warning and routes to --filter. Passing both --filter and a conflicting --tests rejects with an explicit error.
evals benchmarks list — new subcommand surfacing id, displayName, unitNoun, and unit count per benchmark.
README + AGENTS.md — document the manifest schema, new flag shape, and back-compat aliases.

CLI demo

$ evals benchmarks list
benchmark              display                unit  count
---------------------  ---------------------  ----  -----
personal-intelligence  Personal Intelligence  test  1

$ evals run --profiles vellum-bare --filter timeline-recall
# ... same code path as today, just speaking the new flag vocabulary

$ evals run --profiles vellum-bare --tests timeline-recall
[evals] --tests is deprecated; use --benchmark <id> --filter <ids>.
# ... still runs

Scope discipline

No new benchmark wired here — LongMemEval-V2 scaffolding lands in PR 3 against the contract this PR establishes. No new pricing/cost concerns, no runner changes, no metric changes.

Test plan

bun run lint — green
bun run typecheck — green
bun test — 194 pass (was 187; +7 new tests across benchmark.test.ts and catalog.test.ts)
Manual CLI smoke: benchmarks list, benchmarks list --json, tests list, run --help, --tests deprecation warning fires, --filter + --tests conflict rejects.

Follow-ups

PR 3: scaffold benchmarks/longmemeval-v2/{manifest.json, data/download.sh, src/loader.ts}.
A future PR can unify per-benchmark unit listing into something like evals benchmarks units <id> and retire evals tests list, once a second benchmark exists to motivate the shape.

Each public benchmark we run (LongMemEval-V2 next) lives as a peer of personal-intelligence under benchmarks/<id>/. This PR introduces the contract: a Benchmark manifest declares displayName, unitDirName, and unitNoun, and `evals run` accepts a first-class --benchmark flag plus a generic --filter for unit selection. - Add benchmarks/personal-intelligence/manifest.json declaring unitDirName=tests and unitNoun=test, so this benchmark plays by the same rules as the public ones it sits alongside. - Add src/lib/benchmark.ts with BenchmarkManifestSchema + loadBenchmark(id). - Extend src/lib/catalog.ts with getBenchmarksDir, listBenchmarkIds, and listBenchmarkUnitIds. Keep getTestsDir/listTestIds as back-compat shortcuts to the personal-intelligence units dir for legacy callers. - Extend loadTestDef(id, unitsDir?) so callers can resolve units against any benchmark; default to getTestsDir() preserves existing behavior. - `evals run` now takes --benchmark (default personal-intelligence) and --filter <ids> (optional — omit to run every unit). The previous required --tests flag is accepted as a deprecated alias with a one-line stderr warn. Passing both --filter and a conflicting --tests rejects with an explicit error rather than silently picking one. - New `evals benchmarks list` surfaces id, displayName, unitNoun, and unit count for each benchmark. - README + AGENTS.md updated to document the manifest schema, the new flag shape, and the legacy aliases. No new benchmark wired yet — that's PR 3.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aed739678c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T20:11:00Z

+          );
+        }
+        const tests = await Promise.all(
+          unitIds.map((id) => loadTestDef(id, benchmark.unitsDir)),


Preserve EVALS_TESTS_DIR for legacy --tests runs

When callers use the deprecated --tests path with the default personal-intelligence benchmark, this now passes benchmark.unitsDir into loadTestDef, which bypasses the legacy getTestsDir() default and therefore ignores EVALS_TESTS_DIR. Existing local scripts that set EVALS_TESTS_DIR to run an out-of-tree/custom test directory and invoke evals run --profiles ... --tests ... will now look under the committed benchmarks/personal-intelligence/tests directory and fail with a missing SPEC.md, even though evals tests list still honors the override. Consider routing the legacy --tests case through getTestsDir() or teaching loadBenchmark("personal-intelligence") to preserve the override.

Useful? React with 👍 / 👎.

PR 3 of the LongMemEval-V2 integration workstream. - benchmarks/longmemeval-v2/manifest.json declares the benchmark to the harness; conforms to BenchmarkManifestSchema from #32307. - README.md + data/download.sh + data/.gitignore document and automate fetching the 7+ GB dataset from Hugging Face. - src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json and returns BenchmarkItem[] with a strict join (missing haystack entries raise instead of silently dropping). zod schema uses .passthrough() so forward-compat schema additions don't break the loader. - 8 fixture-backed loader tests cover happy paths and every failure mode. - evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/** so per-benchmark TS code is in typecheck + dead-export scope. The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and the aggregator + HTML report land in subsequent PRs against the contract established here. Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

evals: prettier format pass

df23313

dvargasfuertes approved these changes May 27, 2026

View reviewed changes

dvargasfuertes merged commit 768563b into main May 27, 2026
5 checks passed

dvargasfuertes deleted the apollo/evals-benchmark-cli branch May 27, 2026 21:30

vellum-apollo-bot Bot mentioned this pull request May 27, 2026

evals: scaffold longmemeval-v2 benchmark + data loader #32335

Merged

This was referenced May 28, 2026

evals: add runIngestAsk + two-conversation BaseAgent capabilities #32356

Merged

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges) #32363

Merged

api-events: canonicalize interaction-request family (APE.11) #32678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: --benchmark first-class CLI + Benchmark interface#32307

evals: --benchmark first-class CLI + Benchmark interface#32307
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmark-cli

vellum-apollo-bot Bot commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 27, 2026

Summary

What changes

CLI demo

Scope discipline

Test plan

Follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant