evals: nest tests under benchmarks/personal-intelligence/ by vellum-apollo-bot[bot] · Pull Request #32289 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-27T19:16:01Z

Pure refactor. First step toward a flat namespace of peer benchmarks under evals/benchmarks/<id>/.

What changed

Moved evals/tests/timeline-recall/ → evals/benchmarks/personal-intelligence/tests/timeline-recall/
Updated DEFAULT_TESTS_DIR in evals/src/lib/catalog.ts to resolve the new path
Updated doc comments in test-def.ts, AGENTS.md, and README.md to reflect the new layout

What did not change

CLI surface: evals tests list still shows timeline-recall, evals run --tests timeline-recall … still works
EVALS_TESTS_DIR env override semantics
Unit test fixtures (the catalog tests use tmp dirs + the env override; they don't touch the on-disk layout)
Profile layout (evals/profiles/) — profiles are benchmark-orthogonal

Why

Future public benchmarks (LongMemEval-V2, GAIA, etc.) will live as peers to personal-intelligence/ under benchmarks/. This PR sets up the directory shape. Subsequent PRs will introduce a Benchmark abstraction, a --benchmark CLI flag, additional benchmark scaffolding (loader, judge, two-conversation runner), and finally Phase 1 / Phase 2 of LongMemEval-V2 itself.

Verification

bun test src/lib/__tests__/catalog.test.ts → 7 pass
bun run src/cli.ts tests list → shows timeline-recall as before

Plan

This is PR 1 of 8 in the LongMemEval-V2 integration sequence. PRs 2–8 will follow once this lands.

Pure refactor. Moves evals/tests/timeline-recall/ to evals/benchmarks/personal-intelligence/tests/timeline-recall/. Updates DEFAULT_TESTS_DIR in catalog.ts to resolve the new path. No behavior change — same CLI, same env override (EVALS_TESTS_DIR), same outputs. This is the first step toward a flat namespace of peer benchmarks under evals/benchmarks/<id>/. Public benchmarks (longmemeval-v2, GAIA, etc.) will live as peers to personal-intelligence/. Subsequent PRs will introduce a Benchmark abstraction, a --benchmark CLI flag, and additional benchmark scaffolding.

chatgpt-codex-connector

💡 Codex Review

vellum-assistant/evals/benchmarks/personal-intelligence/tests/timeline-recall/metrics/assistant-cost.ts

Line 5 in c7da34d

} from "../../../src/lib/metrics";

Fix metric imports after nesting tests

With these metric files now under evals/benchmarks/personal-intelligence/tests/timeline-recall/metrics, ../../../src/lib/metrics resolves to evals/benchmarks/personal-intelligence/src/lib/metrics, which does not exist. Any evals run --tests timeline-recall ... that reaches scoring will fail when runMetrics() dynamically imports the metric file; the same incorrect relative import is present in date-mentioned.ts as well.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Three import sites still pointed at the old evals/tests/timeline-recall/... layout: - evals/src/lib/__tests__/metrics.test.ts (2 imports) - evals/benchmarks/.../metrics/assistant-cost.ts (1 import) - evals/benchmarks/.../metrics/date-mentioned.ts (1 import) The first surfaced as the CI failure; the latter two were transitively hidden because tsc bailed at the unresolved test-file import before descending into the metric files. Verified: bun run lint, bun run typecheck, bun test all pass.

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

dvargasfuertes approved these changes May 27, 2026

View reviewed changes

dvargasfuertes merged commit 3febf99 into main May 27, 2026
5 checks passed

dvargasfuertes deleted the apollo/evals-benchmarks-layout branch May 27, 2026 19:57

This was referenced May 28, 2026

evals: add runIngestAsk + two-conversation BaseAgent capabilities #32356

Merged

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges) #32363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: nest tests under benchmarks/personal-intelligence/#32289

evals: nest tests under benchmarks/personal-intelligence/#32289
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmarks-layout

vellum-apollo-bot Bot commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 27, 2026

What changed

What did not change

Why

Verification

Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant