evals: scaffold longmemeval-v2 benchmark + data loader#32335
Conversation
PR 3 of the LongMemEval-V2 integration workstream. - benchmarks/longmemeval-v2/manifest.json declares the benchmark to the harness; conforms to BenchmarkManifestSchema from #32307. - README.md + data/download.sh + data/.gitignore document and automate fetching the 7+ GB dataset from Hugging Face. - src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json and returns BenchmarkItem[] with a strict join (missing haystack entries raise instead of silently dropping). zod schema uses .passthrough() so forward-compat schema additions don't break the loader. - 8 fixture-backed loader tests cover happy paths and every failure mode. - evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/** so per-benchmark TS code is in typecheck + dead-export scope. The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and the aggregator + HTML report land in subsequent PRs against the contract established here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3946bd4b6c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| const RawQuestionSchema = z | ||
| .object({ | ||
| question_id: z.string().min(1), |
There was a problem hiding this comment.
When running against the published LongMemEval-V2 data, the question rows use id rather than question_id (per the dataset SCHEMA.md/questions preview: https://huggingface.co/datasets/xiaowu0162/longmemeval-v2/blob/main/SCHEMA.md). This schema therefore rejects every real questions.jsonl line before the haystack join, so the loader cannot load the benchmark downloaded by data/download.sh; map the dataset id field to questionId and use it for the haystack lookup instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — verified against SCHEMA.md. V2 uses id, V1 used question_id. I keyed off the V1 README during recon; this loader rejected every real questions.jsonl row before the haystack join.
Fix in #32354 — switches the schema to id, drops the unused question_date field, updates fixtures to match the real V2 shape, and adds a regression test that rejects V1-shaped rows with a line-numbered error.
Summary
PR 3 of the LongMemEval-V2 integration workstream. Lands the directory scaffold + data loader for the first public benchmark we're wiring into the eval harness — the benchmark that anchors our memory-plugin ablations and gives us numbers comparable to the AgentRunbook-R/C baselines published with the paper.
Spec:
/workspace/scratch/evals-longmemeval-v2-spec.md.Scope this PR: directory layout + manifest + downloader + loader + loader tests. The two-conversation runner, GPT-4o judge, and Phase 1 wiring land in subsequent PRs against this contract.
What's in the box
evals/benchmarks/longmemeval-v2/manifest.json—displayName: "LongMemEval v2",unitDirName: "items",unitNoun: "item". Conforms to theBenchmarkManifestSchemaintroduced in evals: --benchmark first-class CLI + Benchmark interface #32307.README.md— describes the benchmark, the five abilities, the on-disk layout, and the loader contract. Explicitly scopes this PR to loader-only.data/.gitignore— keeps the 7+ GB dataset payload out of the repo. Whitelistsdownload.shand.gitignoreitself.data/download.sh— idempotenthuggingface-cli download xiaowu0162/longmemeval-v2wrapper. Helpful error ifhuggingface-cliis missing. Documents the optional screenshot-extraction step and thechecksums.sha256validation.src/loader.ts—loadLongMemEvalV2({ dataRoot, tier })joinsquestions.jsonlagainsthaystacks/lme_v2_<tier>.jsonand emits camelCaseBenchmarkItem[]. Strict join: any question missing a haystack at the requested tier raises rather than silently dropping.RawQuestionSchemauses.passthrough()for forward compatibility with V2 schema additions.src/__tests__/loader.test.ts+ fixtures — 8 tests covering happy paths, missing files, malformed JSONL with line numbers, schema failures, blank-line tolerance, and the strict-join failure mode.Harness config
evals/tsconfig.json— extendincludeto["src/**/*", "benchmarks/*/src/**/*"]so per-benchmark TS code is in typecheck scope.evals/knip.json— symmetric extension so dead-export detection covers benchmark-local code.evals/README.md— directory tree now shows bothpersonal-intelligence/andlongmemeval-v2/as peer benchmarks.Convention established
Each benchmark may carry its own
src/for benchmark-local logic (loaders, custom scorers, format helpers). The harness picks them up via the extended tsconfig/knip globs. ESLint already covers them via its default glob.CLI smoke
longmemeval-v2showscount: 0becauseitems/doesn't exist on disk yet — items are JSONL-sourced and will be enumerated by the runner in PR 4. The count is honest about current state.Local gate
bun run lint✅bun run format:check✅bun run typecheck✅bun test✅ — 202 pass (+8 new loader tests vs PR 2's 194)Out of scope (next PRs in this workstream)
run-ingest-ask.tstwo-conversation runner (PR 4)vellum-simple-memory(PR 6)AGENTS.md compliance
Grepped
evals/AGENTS.mdandrepos/vellum-assistant/AGENTS.mdforbenchmark,longmemeval,judge.evals/AGENTS.mdalready mentions longmemeval-v2 in the architecture paragraph as the canonical example of a peer benchmark with a differentunitDirName. No new conventions violated; this PR realizes that example.