evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges)#32363
Conversation
Ports V2's `evaluation/qa_eval_metrics.py` to TypeScript:
- `parseEvalFunctionSpec` parses the "name|key=value|..." spec strings
that V2 ships per question in the `eval_function` field. Snake_case
kwarg keys are converted to camelCase for TS callers; the function
name stays in V2 snake_case so the dispatcher matches verbatim.
- Four deterministic evaluators: `normPhraseSetMatch` (and ordered
variant), `mcChoiceMatch`, `mcChoiceSetMatch` — with normalization,
\\boxed{} extraction, multi-select filler-word filtering.
- Two LLM judges: `llmAbstentionChecker` (flawed-premise) and
`llmGotchasChecker` (insight gotchas). Default model is `gpt-5.2`
with `reasoning_effort=medium` (V2 `run_eval.py` defaults). Both
expect JSON `{label, reason}` output and tolerate Markdown code
fences + regex-fallback parsing via `parseLlmBinaryJudgement`.
- OpenAI transport is a direct `fetch` to /chat/completions, matching
the `user-simulator.ts` pattern. Tests swap `globalThis.fetch`; no
production wrapper.
- `evalFromSpec(spec, inputs, overrides)` dispatches and returns
`{ label, reason, function }` for audit-friendly aggregation.
Fixture touch-up: `questions.jsonl` `eval_function` values switch from
the placeholder `exact_match` (not a real V2 function) to real V2 spec
strings (`norm_phrase_set_match`, `norm_phrase_set_match|separators=`,
`norm_phrase_set_match_ordered|separators=>`).
Tests: 86 new across 6 files. Full evals suite 306 pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f28d81d1b9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const patterns: ReadonlyArray<RegExp> = [ | ||
| /"label"\s*:\s*([01])/i, | ||
| /'label'\s*:\s*([01])/i, | ||
| /\blabel\b\s*[:=]\s*([01])/i, |
There was a problem hiding this comment.
Reject non-binary label prefixes in fallback parsing
When the evaluator returns JSON with an out-of-contract numeric label such as {"label": 10} or a score like {"label": 0.5}, the strict JSON branch correctly refuses it, but the fallback regex then matches only the leading 1 or 0 and silently grades the row as pass/fail. This can corrupt benchmark results for malformed judge outputs; require a delimiter/boundary after [01] or skip the regex fallback for JSON objects whose label field was present but invalid.
Useful? React with 👍 / 👎.
PR 5 — Faithful V2 evaluator port
Sixth PR in the LongMemEval-V2 integration sequence (PR-1 #32289, PR-2 #32307, PR-3 #32335, PR-3.1 #32354, PR-4 #32356 all merged).
What and why
Port V2's evaluator from
evaluation/qa_eval_metrics.pyto TypeScript. V2 dispatches per-question on theeval_functionspec string (a"name|key=value|..."format) to one of six implementations:Deterministic (no LLM):
norm_phrase_set_match— phrase-set membership, unorderednorm_phrase_set_match_ordered— phrase-set membership, orderedmc_choice_match— single multiple-choice letter (handles\boxed{})mc_choice_set_match— multi-select MC, set equality on lettersLLM judges (default
gpt-5.2,reasoning_effort=medium, per V2run_eval.py):llm_abstention_checker— flawed-premise questionsllm_gotchas_checker— insight-style gotcha questionsBoth LLM judges use a strict system prompt + rubric and expect
{"label": 0|1, "reason": "..."}JSON output, parsed byparseLlmBinaryJudgement(strips Markdown fences, tries strict JSON, falls back to regex extraction oflabel: 0|1).Scope correction caught pre-code
The locked spec said "GPT-4o paper-faithful judge with V1's
evaluate_qa.pyverbatim." Applied the PR-3.1 lesson — fetched the canonical sources from the V2 repo before writing any code:evaluate_qa.pyqa_eval_metrics.pygpt-4o-2024-08-06gpt-5.2,reasoning_effort=mediumeval_function; 4 deterministic + 2 LLMUpdated the spec doc and workstream record before the implementation. Pattern earns its place in
pr-lifecycle.md.Architecture choices
require_non_empty,strip_chars). The dispatcher matches function names verbatim in snake_case but converts kwarg keys to camelCase so TS callers can spread them alongside other options.fetch. Mirrorsevals/src/lib/simulator/user-simulator.ts. Tests swapglobalThis.fetch. No production wrapper for testing (perunit-testing.mdrule ci: add web and platform CI/PR workflows #4).evalFromSpec(spec, inputs, overrides)merges{ ...kwargs, ...overrides }, matching V2'seval_from_spec(spec, *args, **overrides)semantics.EvalResultkeeps thereason. V2's Pythonllm_*_checkerdiscards the reason from_parse_llm_binary_judgement. We preserve it (strictly additive) so Phase 2 reports can show why a judge labeled 0. Function name is echoed for audit.reason: "". SameEvalResultshape across all six.Files
Tests at
evals/benchmarks/longmemeval-v2/src/__tests__/judge/, one file per module.Fixture touch-up (small, bundled here)
questions.jsonlhad"eval_function": "exact_match"placeholders — not a real V2 function. Replaced with real V2 spec strings exercising three patterns:q_001:norm_phrase_set_matchq_002:norm_phrase_set_match|separators=(empty separators ⇒ single phrase, for the12,481answer)q_003:norm_phrase_set_match_ordered|separators=>(theDashboards > New > template > Saveworkflow sequence)Loader tests don't pin
eval_functionvalues, so this is non-breaking.Tests
normalize.test.ts(12),spec.test.ts(17),deterministic.test.ts(20),judgement.test.ts(13),llm.test.ts(9),index.test.ts(9). LLM tests mockglobalThis.fetchand assert request shape (URL, bearer auth, body fields, message structure) and response handling (JSON, code-fenced, non-2xx errors).lint,format:check,typecheckall clean.AGENTS.md compliance
evals/AGENTS.md— judge lives under the benchmark'ssrc/; tsconfig + knip already includebenchmarks/*/src/**/*.AGENTS.md"Assistant-Driven Judgement" section is about runtime user-facing judgement calls. Benchmark evaluators are by purpose deterministic measurement tooling (porting an upstream paper-faithful judge); the rule's "mechanical operations" carve-out applies.Next
The two-conversation runner (PR-4 #32356) + this judge unblock PR-6: Phase 1 wire — 5-item smoke against
vellum-simple-memoryproducing graded scores end-to-end.