Skip to content

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges)#32363

Merged
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-v2-judge
May 28, 2026
Merged

evals(longmemeval-v2): faithful V2 evaluator port (parse + 4 deterministic + 2 LLM judges)#32363
dvargasfuertes merged 1 commit into
mainfrom
apollo/evals-v2-judge

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

PR 5 — Faithful V2 evaluator port

Sixth PR in the LongMemEval-V2 integration sequence (PR-1 #32289, PR-2 #32307, PR-3 #32335, PR-3.1 #32354, PR-4 #32356 all merged).

What and why

Port V2's evaluator from evaluation/qa_eval_metrics.py to TypeScript. V2 dispatches per-question on the eval_function spec string (a "name|key=value|..." format) to one of six implementations:

Deterministic (no LLM):

  • norm_phrase_set_match — phrase-set membership, unordered
  • norm_phrase_set_match_ordered — phrase-set membership, ordered
  • mc_choice_match — single multiple-choice letter (handles \boxed{})
  • mc_choice_set_match — multi-select MC, set equality on letters

LLM judges (default gpt-5.2, reasoning_effort=medium, per V2 run_eval.py):

  • llm_abstention_checker — flawed-premise questions
  • llm_gotchas_checker — insight-style gotcha questions

Both LLM judges use a strict system prompt + rubric and expect {"label": 0|1, "reason": "..."} JSON output, parsed by parseLlmBinaryJudgement (strips Markdown fences, tries strict JSON, falls back to regex extraction of label: 0|1).

Scope correction caught pre-code

The locked spec said "GPT-4o paper-faithful judge with V1's evaluate_qa.py verbatim." Applied the PR-3.1 lesson — fetched the canonical sources from the V2 repo before writing any code:

Spec V2 reality
Evaluator file V1 evaluate_qa.py V2 qa_eval_metrics.py
Model gpt-4o-2024-08-06 gpt-5.2, reasoning_effort=medium
Shape One LLM judge, yes/no, 5 question-type templates Dispatch on per-question eval_function; 4 deterministic + 2 LLM

Updated the spec doc and workstream record before the implementation. Pattern earns its place in pr-lifecycle.md.

Architecture choices

  • Snake_case → camelCase at parse time. V2's spec strings use Python-idiomatic snake_case keys (require_non_empty, strip_chars). The dispatcher matches function names verbatim in snake_case but converts kwarg keys to camelCase so TS callers can spread them alongside other options.
  • OpenAI transport: direct fetch. Mirrors evals/src/lib/simulator/user-simulator.ts. Tests swap globalThis.fetch. No production wrapper for testing (per unit-testing.md rule ci: add web and platform CI/PR workflows #4).
  • Caller overrides win over spec kwargs. evalFromSpec(spec, inputs, overrides) merges { ...kwargs, ...overrides }, matching V2's eval_from_spec(spec, *args, **overrides) semantics.
  • EvalResult keeps the reason. V2's Python llm_*_checker discards the reason from _parse_llm_binary_judgement. We preserve it (strictly additive) so Phase 2 reports can show why a judge labeled 0. Function name is echoed for audit.
  • Deterministic functions return reason: "". Same EvalResult shape across all six.

Files

evals/benchmarks/longmemeval-v2/src/judge/
├── index.ts          # evalFromSpec dispatcher + public re-exports
├── spec.ts           # parseEvalFunctionSpec + parseEvalValue
├── normalize.ts      # normalizePhrase + splitPhrases
├── deterministic.ts  # 4 deterministic evaluators
├── judgement.ts      # parseLlmBinaryJudgement + stripMarkdownCodeFence
└── llm.ts            # 2 LLM judges + chat completion transport

Tests at evals/benchmarks/longmemeval-v2/src/__tests__/judge/, one file per module.

Fixture touch-up (small, bundled here)

questions.jsonl had "eval_function": "exact_match" placeholders — not a real V2 function. Replaced with real V2 spec strings exercising three patterns:

  • q_001: norm_phrase_set_match
  • q_002: norm_phrase_set_match|separators= (empty separators ⇒ single phrase, for the 12,481 answer)
  • q_003: norm_phrase_set_match_ordered|separators=> (the Dashboards > New > template > Save workflow sequence)

Loader tests don't pin eval_function values, so this is non-breaking.

Tests

  • 86 new tests across 6 files: normalize.test.ts (12), spec.test.ts (17), deterministic.test.ts (20), judgement.test.ts (13), llm.test.ts (9), index.test.ts (9). LLM tests mock globalThis.fetch and assert request shape (URL, bearer auth, body fields, message structure) and response handling (JSON, code-fenced, non-2xx errors).
  • Full evals suite: 306 pass, 0 fail.
  • Local gate: lint, format:check, typecheck all clean.

AGENTS.md compliance

  • evals/AGENTS.md — judge lives under the benchmark's src/; tsconfig + knip already include benchmarks/*/src/**/*.
  • AGENTS.md "Assistant-Driven Judgement" section is about runtime user-facing judgement calls. Benchmark evaluators are by purpose deterministic measurement tooling (porting an upstream paper-faithful judge); the rule's "mechanical operations" carve-out applies.

Next

The two-conversation runner (PR-4 #32356) + this judge unblock PR-6: Phase 1 wire — 5-item smoke against vellum-simple-memory producing graded scores end-to-end.

Ports V2's `evaluation/qa_eval_metrics.py` to TypeScript:

- `parseEvalFunctionSpec` parses the "name|key=value|..." spec strings
  that V2 ships per question in the `eval_function` field. Snake_case
  kwarg keys are converted to camelCase for TS callers; the function
  name stays in V2 snake_case so the dispatcher matches verbatim.
- Four deterministic evaluators: `normPhraseSetMatch` (and ordered
  variant), `mcChoiceMatch`, `mcChoiceSetMatch` — with normalization,
  \\boxed{} extraction, multi-select filler-word filtering.
- Two LLM judges: `llmAbstentionChecker` (flawed-premise) and
  `llmGotchasChecker` (insight gotchas). Default model is `gpt-5.2`
  with `reasoning_effort=medium` (V2 `run_eval.py` defaults). Both
  expect JSON `{label, reason}` output and tolerate Markdown code
  fences + regex-fallback parsing via `parseLlmBinaryJudgement`.
- OpenAI transport is a direct `fetch` to /chat/completions, matching
  the `user-simulator.ts` pattern. Tests swap `globalThis.fetch`; no
  production wrapper.
- `evalFromSpec(spec, inputs, overrides)` dispatches and returns
  `{ label, reason, function }` for audit-friendly aggregation.

Fixture touch-up: `questions.jsonl` `eval_function` values switch from
the placeholder `exact_match` (not a real V2 function) to real V2 spec
strings (`norm_phrase_set_match`, `norm_phrase_set_match|separators=`,
`norm_phrase_set_match_ordered|separators=>`).

Tests: 86 new across 6 files. Full evals suite 306 pass.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f28d81d1b9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +55 to +58
const patterns: ReadonlyArray<RegExp> = [
/"label"\s*:\s*([01])/i,
/'label'\s*:\s*([01])/i,
/\blabel\b\s*[:=]\s*([01])/i,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-binary label prefixes in fallback parsing

When the evaluator returns JSON with an out-of-contract numeric label such as {"label": 10} or a score like {"label": 0.5}, the strict JSON branch correctly refuses it, but the fallback regex then matches only the leading 1 or 0 and silently grades the row as pass/fail. This can corrupt benchmark results for malformed judge outputs; require a delimiter/boundary after [01] or skip the regex fallback for JSON objects whose label field was present but invalid.

Useful? React with 👍 / 👎.

@dvargasfuertes dvargasfuertes merged commit a6f4119 into main May 28, 2026
5 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/evals-v2-judge branch May 28, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant