Skip to content

evals: scaffold longmemeval-v2 benchmark + data loader#32335

Merged
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-longmemeval-v2-scaffold
May 28, 2026
Merged

evals: scaffold longmemeval-v2 benchmark + data loader#32335
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-longmemeval-v2-scaffold

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

Summary

PR 3 of the LongMemEval-V2 integration workstream. Lands the directory scaffold + data loader for the first public benchmark we're wiring into the eval harness — the benchmark that anchors our memory-plugin ablations and gives us numbers comparable to the AgentRunbook-R/C baselines published with the paper.

Spec: /workspace/scratch/evals-longmemeval-v2-spec.md.

Scope this PR: directory layout + manifest + downloader + loader + loader tests. The two-conversation runner, GPT-4o judge, and Phase 1 wiring land in subsequent PRs against this contract.

What's in the box

evals/benchmarks/longmemeval-v2/

  • manifest.jsondisplayName: "LongMemEval v2", unitDirName: "items", unitNoun: "item". Conforms to the BenchmarkManifestSchema introduced in evals: --benchmark first-class CLI + Benchmark interface #32307.
  • README.md — describes the benchmark, the five abilities, the on-disk layout, and the loader contract. Explicitly scopes this PR to loader-only.
  • data/.gitignore — keeps the 7+ GB dataset payload out of the repo. Whitelists download.sh and .gitignore itself.
  • data/download.sh — idempotent huggingface-cli download xiaowu0162/longmemeval-v2 wrapper. Helpful error if huggingface-cli is missing. Documents the optional screenshot-extraction step and the checksums.sha256 validation.
  • src/loader.tsloadLongMemEvalV2({ dataRoot, tier }) joins questions.jsonl against haystacks/lme_v2_<tier>.json and emits camelCase BenchmarkItem[]. Strict join: any question missing a haystack at the requested tier raises rather than silently dropping. RawQuestionSchema uses .passthrough() for forward compatibility with V2 schema additions.
  • src/__tests__/loader.test.ts + fixtures — 8 tests covering happy paths, missing files, malformed JSONL with line numbers, schema failures, blank-line tolerance, and the strict-join failure mode.

Harness config

  • evals/tsconfig.json — extend include to ["src/**/*", "benchmarks/*/src/**/*"] so per-benchmark TS code is in typecheck scope.
  • evals/knip.json — symmetric extension so dead-export detection covers benchmark-local code.
  • evals/README.md — directory tree now shows both personal-intelligence/ and longmemeval-v2/ as peer benchmarks.

Convention established

Each benchmark may carry its own src/ for benchmark-local logic (loaders, custom scorers, format helpers). The harness picks them up via the extended tsconfig/knip globs. ESLint already covers them via its default glob.

CLI smoke

$ bun src/cli.ts benchmarks list
benchmark              display                unit  count
---------------------  ---------------------  ----  -----
longmemeval-v2         LongMemEval v2         item  0
personal-intelligence  Personal Intelligence  test  1

longmemeval-v2 shows count: 0 because items/ doesn't exist on disk yet — items are JSONL-sourced and will be enumerated by the runner in PR 4. The count is honest about current state.

Local gate

  • bun run lint
  • bun run format:check
  • bun run typecheck
  • bun test ✅ — 202 pass (+8 new loader tests vs PR 2's 194)

Out of scope (next PRs in this workstream)

  • run-ingest-ask.ts two-conversation runner (PR 4)
  • GPT-4o paper-faithful judge (PR 5)
  • Phase 1 smoke wiring against vellum-simple-memory (PR 6)
  • Ingest + judge caches (PR 7)
  • Aggregator + HTML report with AgentRunbook-R/C comparison column (PR 8)

AGENTS.md compliance

Grepped evals/AGENTS.md and repos/vellum-assistant/AGENTS.md for benchmark, longmemeval, judge. evals/AGENTS.md already mentions longmemeval-v2 in the architecture paragraph as the canonical example of a peer benchmark with a different unitDirName. No new conventions violated; this PR realizes that example.

PR 3 of the LongMemEval-V2 integration workstream.

- benchmarks/longmemeval-v2/manifest.json declares the benchmark to the
  harness; conforms to BenchmarkManifestSchema from #32307.
- README.md + data/download.sh + data/.gitignore document and automate
  fetching the 7+ GB dataset from Hugging Face.
- src/loader.ts joins questions.jsonl against haystacks/lme_v2_<tier>.json
  and returns BenchmarkItem[] with a strict join (missing haystack entries
  raise instead of silently dropping). zod schema uses .passthrough() so
  forward-compat schema additions don't break the loader.
- 8 fixture-backed loader tests cover happy paths and every failure mode.
- evals/tsconfig.json + knip.json extended to cover benchmarks/*/src/**
  so per-benchmark TS code is in typecheck + dead-export scope.

The two-conversation runner, GPT-4o judge, Phase 1 wiring, caches, and
the aggregator + HTML report land in subsequent PRs against the contract
established here.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3946bd4b6c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


const RawQuestionSchema = z
.object({
question_id: z.string().min(1),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read question ids from id

When running against the published LongMemEval-V2 data, the question rows use id rather than question_id (per the dataset SCHEMA.md/questions preview: https://huggingface.co/datasets/xiaowu0162/longmemeval-v2/blob/main/SCHEMA.md). This schema therefore rejects every real questions.jsonl line before the haystack join, so the loader cannot load the benchmark downloaded by data/download.sh; map the dataset id field to questionId and use it for the haystack lookup instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — verified against SCHEMA.md. V2 uses id, V1 used question_id. I keyed off the V1 README during recon; this loader rejected every real questions.jsonl row before the haystack join.

Fix in #32354 — switches the schema to id, drops the unused question_date field, updates fixtures to match the real V2 shape, and adds a regression test that rejects V1-shaped rows with a line-numbered error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant