Skip to content

evals: nest tests under benchmarks/personal-intelligence/#32289

Merged
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmarks-layout
May 27, 2026
Merged

evals: nest tests under benchmarks/personal-intelligence/#32289
dvargasfuertes merged 2 commits into
mainfrom
apollo/evals-benchmarks-layout

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

Pure refactor. First step toward a flat namespace of peer benchmarks under evals/benchmarks/<id>/.

What changed

  • Moved evals/tests/timeline-recall/evals/benchmarks/personal-intelligence/tests/timeline-recall/
  • Updated DEFAULT_TESTS_DIR in evals/src/lib/catalog.ts to resolve the new path
  • Updated doc comments in test-def.ts, AGENTS.md, and README.md to reflect the new layout

What did not change

  • CLI surface: evals tests list still shows timeline-recall, evals run --tests timeline-recall … still works
  • EVALS_TESTS_DIR env override semantics
  • Unit test fixtures (the catalog tests use tmp dirs + the env override; they don't touch the on-disk layout)
  • Profile layout (evals/profiles/) — profiles are benchmark-orthogonal

Why

Future public benchmarks (LongMemEval-V2, GAIA, etc.) will live as peers to personal-intelligence/ under benchmarks/. This PR sets up the directory shape. Subsequent PRs will introduce a Benchmark abstraction, a --benchmark CLI flag, additional benchmark scaffolding (loader, judge, two-conversation runner), and finally Phase 1 / Phase 2 of LongMemEval-V2 itself.

Verification

  • bun test src/lib/__tests__/catalog.test.ts → 7 pass
  • bun run src/cli.ts tests list → shows timeline-recall as before

Plan

This is PR 1 of 8 in the LongMemEval-V2 integration sequence. PRs 2–8 will follow once this lands.

Pure refactor. Moves evals/tests/timeline-recall/ to
evals/benchmarks/personal-intelligence/tests/timeline-recall/. Updates
DEFAULT_TESTS_DIR in catalog.ts to resolve the new path. No behavior
change — same CLI, same env override (EVALS_TESTS_DIR), same outputs.

This is the first step toward a flat namespace of peer benchmarks
under evals/benchmarks/<id>/. Public benchmarks (longmemeval-v2,
GAIA, etc.) will live as peers to personal-intelligence/. Subsequent
PRs will introduce a Benchmark abstraction, a --benchmark CLI flag,
and additional benchmark scaffolding.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review


P2 Badge Fix metric imports after nesting tests

With these metric files now under evals/benchmarks/personal-intelligence/tests/timeline-recall/metrics, ../../../src/lib/metrics resolves to evals/benchmarks/personal-intelligence/src/lib/metrics, which does not exist. Any evals run --tests timeline-recall ... that reaches scoring will fail when runMetrics() dynamically imports the metric file; the same incorrect relative import is present in date-mentioned.ts as well.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Three import sites still pointed at the old evals/tests/timeline-recall/...
layout:

- evals/src/lib/__tests__/metrics.test.ts (2 imports)
- evals/benchmarks/.../metrics/assistant-cost.ts (1 import)
- evals/benchmarks/.../metrics/date-mentioned.ts (1 import)

The first surfaced as the CI failure; the latter two were transitively
hidden because tsc bailed at the unresolved test-file import before
descending into the metric files.

Verified: bun run lint, bun run typecheck, bun test all pass.
@dvargasfuertes dvargasfuertes merged commit 3febf99 into main May 27, 2026
5 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/evals-benchmarks-layout branch May 27, 2026 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant