evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges) by vellum-apollo-bot[bot] · Pull Request #32363 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-28T10:42:32Z

PR 5 — Faithful V2 evaluator port

Sixth PR in the LongMemEval-V2 integration sequence (PR-1 #32289, PR-2 #32307, PR-3 #32335, PR-3.1 #32354, PR-4 #32356 all merged).

What and why

Port V2's evaluator from evaluation/qa_eval_metrics.py to TypeScript. V2 dispatches per-question on the eval_function spec string (a "name|key=value|..." format) to one of six implementations:

Deterministic (no LLM):

norm_phrase_set_match — phrase-set membership, unordered
norm_phrase_set_match_ordered — phrase-set membership, ordered
mc_choice_match — single multiple-choice letter (handles \boxed{})
mc_choice_set_match — multi-select MC, set equality on letters

LLM judges (default gpt-5.2, reasoning_effort=medium, per V2 run_eval.py):

llm_abstention_checker — flawed-premise questions
llm_gotchas_checker — insight-style gotcha questions

Both LLM judges use a strict system prompt + rubric and expect {"label": 0|1, "reason": "..."} JSON output, parsed by parseLlmBinaryJudgement (strips Markdown fences, tries strict JSON, falls back to regex extraction of label: 0|1).

Scope correction caught pre-code

The locked spec said "GPT-4o paper-faithful judge with V1's evaluate_qa.py verbatim." Applied the PR-3.1 lesson — fetched the canonical sources from the V2 repo before writing any code:

	Spec	V2 reality
Evaluator file	V1 `evaluate_qa.py`	V2 `qa_eval_metrics.py`
Model	`gpt-4o-2024-08-06`	`gpt-5.2`, `reasoning_effort=medium`
Shape	One LLM judge, yes/no, 5 question-type templates	Dispatch on per-question `eval_function`; 4 deterministic + 2 LLM

Updated the spec doc and workstream record before the implementation. Pattern earns its place in pr-lifecycle.md.

Architecture choices

Snake_case → camelCase at parse time. V2's spec strings use Python-idiomatic snake_case keys (require_non_empty, strip_chars). The dispatcher matches function names verbatim in snake_case but converts kwarg keys to camelCase so TS callers can spread them alongside other options.
OpenAI transport: direct fetch. Mirrors evals/src/lib/simulator/user-simulator.ts. Tests swap globalThis.fetch. No production wrapper for testing (per unit-testing.md rule ci: add web and platform CI/PR workflows #4).
Caller overrides win over spec kwargs. evalFromSpec(spec, inputs, overrides) merges { ...kwargs, ...overrides }, matching V2's eval_from_spec(spec, *args, **overrides) semantics.
EvalResult keeps the reason. V2's Python llm_*_checker discards the reason from _parse_llm_binary_judgement. We preserve it (strictly additive) so Phase 2 reports can show why a judge labeled 0. Function name is echoed for audit.
Deterministic functions return reason: "". Same EvalResult shape across all six.

Files

evals/benchmarks/longmemeval-v2/src/judge/
├── index.ts          # evalFromSpec dispatcher + public re-exports
├── spec.ts           # parseEvalFunctionSpec + parseEvalValue
├── normalize.ts      # normalizePhrase + splitPhrases
├── deterministic.ts  # 4 deterministic evaluators
├── judgement.ts      # parseLlmBinaryJudgement + stripMarkdownCodeFence
└── llm.ts            # 2 LLM judges + chat completion transport

Tests at evals/benchmarks/longmemeval-v2/src/__tests__/judge/, one file per module.

Fixture touch-up (small, bundled here)

questions.jsonl had "eval_function": "exact_match" placeholders — not a real V2 function. Replaced with real V2 spec strings exercising three patterns:

q_001: norm_phrase_set_match
q_002: norm_phrase_set_match|separators= (empty separators ⇒ single phrase, for the 12,481 answer)
q_003: norm_phrase_set_match_ordered|separators=> (the Dashboards > New > template > Save workflow sequence)

Loader tests don't pin eval_function values, so this is non-breaking.

Tests

86 new tests across 6 files: normalize.test.ts (12), spec.test.ts (17), deterministic.test.ts (20), judgement.test.ts (13), llm.test.ts (9), index.test.ts (9). LLM tests mock globalThis.fetch and assert request shape (URL, bearer auth, body fields, message structure) and response handling (JSON, code-fenced, non-2xx errors).
Full evals suite: 306 pass, 0 fail.
Local gate: lint, format:check, typecheck all clean.

AGENTS.md compliance

evals/AGENTS.md — judge lives under the benchmark's src/; tsconfig + knip already include benchmarks/*/src/**/*.
AGENTS.md "Assistant-Driven Judgement" section is about runtime user-facing judgement calls. Benchmark evaluators are by purpose deterministic measurement tooling (porting an upstream paper-faithful judge); the rule's "mechanical operations" carve-out applies.

Ports V2's `evaluation/qa_eval_metrics.py` to TypeScript: - `parseEvalFunctionSpec` parses the "name|key=value|..." spec strings that V2 ships per question in the `eval_function` field. Snake_case kwarg keys are converted to camelCase for TS callers; the function name stays in V2 snake_case so the dispatcher matches verbatim. - Four deterministic evaluators: `normPhraseSetMatch` (and ordered variant), `mcChoiceMatch`, `mcChoiceSetMatch` — with normalization, \\boxed{} extraction, multi-select filler-word filtering. - Two LLM judges: `llmAbstentionChecker` (flawed-premise) and `llmGotchasChecker` (insight gotchas). Default model is `gpt-5.2` with `reasoning_effort=medium` (V2 `run_eval.py` defaults). Both expect JSON `{label, reason}` output and tolerate Markdown code fences + regex-fallback parsing via `parseLlmBinaryJudgement`. - OpenAI transport is a direct `fetch` to /chat/completions, matching the `user-simulator.ts` pattern. Tests swap `globalThis.fetch`; no production wrapper. - `evalFromSpec(spec, inputs, overrides)` dispatches and returns `{ label, reason, function }` for audit-friendly aggregation. Fixture touch-up: `questions.jsonl` `eval_function` values switch from the placeholder `exact_match` (not a real V2 function) to real V2 spec strings (`norm_phrase_set_match`, `norm_phrase_set_match|separators=`, `norm_phrase_set_match_ordered|separators=>`). Tests: 86 new across 6 files. Full evals suite 306 pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f28d81d1b9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T10:46:39Z

+  const patterns: ReadonlyArray<RegExp> = [
+    /"label"\s*:\s*([01])/i,
+    /'label'\s*:\s*([01])/i,
+    /\blabel\b\s*[:=]\s*([01])/i,


Reject non-binary label prefixes in fallback parsing

When the evaluator returns JSON with an out-of-contract numeric label such as {"label": 10} or a score like {"label": 0.5}, the strict JSON branch correctly refuses it, but the fallback regex then matches only the leading 1 or 0 and silently grades the row as pass/fail. This can corrupt benchmark results for malformed judge outputs; require a delimiter/boundary after [01] or skip the regex fallback for JSON objects whose label field was present but invalid.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

dvargasfuertes approved these changes May 28, 2026

View reviewed changes

dvargasfuertes merged commit a6f4119 into main May 28, 2026
5 checks passed

dvargasfuertes deleted the apollo/evals-v2-judge branch May 28, 2026 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges)#32363

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges)#32363
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-v2-judge

vellum-apollo-bot Bot commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 28, 2026

PR 5 — Faithful V2 evaluator port

What and why

Scope correction caught pre-code

Architecture choices

Files

Fixture touch-up (small, bundled here)

Tests

AGENTS.md compliance

Next

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant