evals: scaffold longmemeval-v2 benchmark + data loader by vellum-apollo-bot[bot] · Pull Request #32335 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-27T21:40:01Z

Summary

PR 3 of the LongMemEval-V2 integration workstream. Lands the directory scaffold + data loader for the first public benchmark we're wiring into the eval harness — the benchmark that anchors our memory-plugin ablations and gives us numbers comparable to the AgentRunbook-R/C baselines published with the paper.

Spec: /workspace/scratch/evals-longmemeval-v2-spec.md.

Scope this PR: directory layout + manifest + downloader + loader + loader tests. The two-conversation runner, GPT-4o judge, and Phase 1 wiring land in subsequent PRs against this contract.

What's in the box

`evals/benchmarks/longmemeval-v2/`

manifest.json — displayName: "LongMemEval v2", unitDirName: "items", unitNoun: "item". Conforms to the BenchmarkManifestSchema introduced in evals: --benchmark first-class CLI + Benchmark interface #32307.
README.md — describes the benchmark, the five abilities, the on-disk layout, and the loader contract. Explicitly scopes this PR to loader-only.
data/.gitignore — keeps the 7+ GB dataset payload out of the repo. Whitelists download.sh and .gitignore itself.
data/download.sh — idempotent huggingface-cli download xiaowu0162/longmemeval-v2 wrapper. Helpful error if huggingface-cli is missing. Documents the optional screenshot-extraction step and the checksums.sha256 validation.
src/loader.ts — loadLongMemEvalV2({ dataRoot, tier }) joins questions.jsonl against haystacks/lme_v2_<tier>.json and emits camelCase BenchmarkItem[]. Strict join: any question missing a haystack at the requested tier raises rather than silently dropping. RawQuestionSchema uses .passthrough() for forward compatibility with V2 schema additions.
src/__tests__/loader.test.ts + fixtures — 8 tests covering happy paths, missing files, malformed JSONL with line numbers, schema failures, blank-line tolerance, and the strict-join failure mode.

Harness config

evals/tsconfig.json — extend include to ["src/**/*", "benchmarks/*/src/**/*"] so per-benchmark TS code is in typecheck scope.
evals/knip.json — symmetric extension so dead-export detection covers benchmark-local code.
evals/README.md — directory tree now shows both personal-intelligence/ and longmemeval-v2/ as peer benchmarks.

Convention established

Each benchmark may carry its own src/ for benchmark-local logic (loaders, custom scorers, format helpers). The harness picks them up via the extended tsconfig/knip globs. ESLint already covers them via its default glob.

CLI smoke

$ bun src/cli.ts benchmarks list
benchmark              display                unit  count
---------------------  ---------------------  ----  -----
longmemeval-v2         LongMemEval v2         item  0
personal-intelligence  Personal Intelligence  test  1

longmemeval-v2 shows count: 0 because items/ doesn't exist on disk yet — items are JSONL-sourced and will be enumerated by the runner in PR 4. The count is honest about current state.

Local gate

bun run lint ✅
bun run format:check ✅
bun run typecheck ✅
bun test ✅ — 202 pass (+8 new loader tests vs PR 2's 194)

Out of scope (next PRs in this workstream)

run-ingest-ask.ts two-conversation runner (PR 4)
GPT-4o paper-faithful judge (PR 5)
Phase 1 smoke wiring against vellum-simple-memory (PR 6)
Ingest + judge caches (PR 7)
Aggregator + HTML report with AgentRunbook-R/C comparison column (PR 8)

AGENTS.md compliance

Grepped evals/AGENTS.md and repos/vellum-assistant/AGENTS.md for benchmark, longmemeval, judge. evals/AGENTS.md already mentions longmemeval-v2 in the architecture paragraph as the canonical example of a peer benchmark with a different unitDirName. No new conventions violated; this PR realizes that example.

PR 3 of the LongMemEval-V2 integration workstream. - benchmarks/longmemeval-v2/manifest.json declares the benchmark to the harness; conforms to BenchmarkManifestSchema from #32307. - README.md + data/download.sh + data/.gitignore document and automate fetching the 7+ GB dataset from Hugging Face. - src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json and returns BenchmarkItem[] with a strict join (missing haystack entries raise instead of silently dropping). zod schema uses .passthrough() so forward-compat schema additions don't break the loader. - 8 fixture-backed loader tests cover happy paths and every failure mode. - evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/** so per-benchmark TS code is in typecheck + dead-export scope. The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and the aggregator + HTML report land in subsequent PRs against the contract established here.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3946bd4b6c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T21:41:53Z

+
+const RawQuestionSchema = z
+  .object({
+    question_id: z.string().min(1),


Read question ids from id

When running against the published LongMemEval-V2 data, the question rows use id rather than question_id (per the dataset SCHEMA.md/questions preview: https://huggingface.co/datasets/xiaowu0162/longmemeval-v2/blob/main/SCHEMA.md). This schema therefore rejects every real questions.jsonl line before the haystack join, so the loader cannot load the benchmark downloaded by data/download.sh; map the dataset id field to questionId and use it for the haystack lookup instead.

Useful? React with 👍 / 👎.

Good catch — verified against SCHEMA.md. V2 uses id, V1 used question_id. I keyed off the V1 README during recon; this loader rejected every real questions.jsonl row before the haystack join.

Fix in #32354 — switches the schema to id, drops the unused question_date field, updates fixtures to match the real V2 shape, and adds a regression test that rejects V1-shaped rows with a line-numbered error.

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

dvargasfuertes approved these changes May 28, 2026

View reviewed changes

dvargasfuertes merged commit a3625cb into main May 28, 2026
5 checks passed

dvargasfuertes deleted the apollo/evals-longmemeval-v2-scaffold branch May 28, 2026 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: scaffold longmemeval-v2 benchmark + data loader#32335

evals: scaffold longmemeval-v2 benchmark + data loader#32335
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-longmemeval-v2-scaffold

vellum-apollo-bot Bot commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

vellum-apollo-bot Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 27, 2026

Summary

What's in the box

evals/benchmarks/longmemeval-v2/

Harness config

Convention established

CLI smoke

Local gate

Out of scope (next PRs in this workstream)

AGENTS.md compliance

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`evals/benchmarks/longmemeval-v2/`