evals: nest tests under benchmarks/personal-intelligence/#32289
Conversation
Pure refactor. Moves evals/tests/timeline-recall/ to evals/benchmarks/personal-intelligence/tests/timeline-recall/. Updates DEFAULT_TESTS_DIR in catalog.ts to resolve the new path. No behavior change — same CLI, same env override (EVALS_TESTS_DIR), same outputs. This is the first step toward a flat namespace of peer benchmarks under evals/benchmarks/<id>/. Public benchmarks (longmemeval-v2, GAIA, etc.) will live as peers to personal-intelligence/. Subsequent PRs will introduce a Benchmark abstraction, a --benchmark CLI flag, and additional benchmark scaffolding.
There was a problem hiding this comment.
💡 Codex Review
With these metric files now under evals/benchmarks/personal-intelligence/tests/timeline-recall/metrics, ../../../src/lib/metrics resolves to evals/benchmarks/personal-intelligence/src/lib/metrics, which does not exist. Any evals run --tests timeline-recall ... that reaches scoring will fail when runMetrics() dynamically imports the metric file; the same incorrect relative import is present in date-mentioned.ts as well.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Three import sites still pointed at the old evals/tests/timeline-recall/... layout: - evals/src/lib/__tests__/metrics.test.ts (2 imports) - evals/benchmarks/.../metrics/assistant-cost.ts (1 import) - evals/benchmarks/.../metrics/date-mentioned.ts (1 import) The first surfaced as the CI failure; the latter two were transitively hidden because tsc bailed at the unresolved test-file import before descending into the metric files. Verified: bun run lint, bun run typecheck, bun test all pass.
Pure refactor. First step toward a flat namespace of peer benchmarks under
evals/benchmarks/<id>/.What changed
evals/tests/timeline-recall/→evals/benchmarks/personal-intelligence/tests/timeline-recall/DEFAULT_TESTS_DIRinevals/src/lib/catalog.tsto resolve the new pathtest-def.ts,AGENTS.md, andREADME.mdto reflect the new layoutWhat did not change
evals tests liststill showstimeline-recall,evals run --tests timeline-recall …still worksEVALS_TESTS_DIRenv override semanticsevals/profiles/) — profiles are benchmark-orthogonalWhy
Future public benchmarks (LongMemEval-V2, GAIA, etc.) will live as peers to
personal-intelligence/underbenchmarks/. This PR sets up the directory shape. Subsequent PRs will introduce aBenchmarkabstraction, a--benchmarkCLI flag, additional benchmark scaffolding (loader, judge, two-conversation runner), and finally Phase 1 / Phase 2 of LongMemEval-V2 itself.Verification
bun test src/lib/__tests__/catalog.test.ts→ 7 passbun run src/cli.ts tests list→ showstimeline-recallas beforePlan
This is PR 1 of 8 in the LongMemEval-V2 integration sequence. PRs 2–8 will follow once this lands.