Add Arena-Hard-v2 benchmark support by bzantium · Pull Request #1152 · NVIDIA-NeMo/Skills

bzantium · 2026-01-06T04:50:33Z

Summary

This PR adds support for the Arena-Hard-v2 benchmark, enabling evaluation of chat models using LLM-as-a-judge methodology with o3-mini as the judge.

What's Changed

Added arena-hard-v2 dataset with 750 test questions
Automated data preparation script that downloads questions and baseline answers from lm-sys/arena-hard-auto
Configured LLM-based judge evaluation pipeline
Integrated with NeMo Skills' arena metrics system

Files Added

nemo_skills/dataset/arena-hard-v2/__init__.py - Dataset configuration with judge pipeline settings
nemo_skills/dataset/arena-hard-v2/prepare.py - Data preparation script

Technical Details

Baseline Model: o3-mini-2025-01-31
Default Judge: o3-mini-2025-01-31
Metrics Type: Arena-style pairwise comparison
Dataset Size: 750 questions

Testing

Data preparation script runs successfully
Dataset loads correctly in evaluation pipeline
Judge integration works with OpenAI API

References

Upstream repository: https://github.com/lm-sys/arena-hard-auto
Arena-Hard-v2 data: https://github.com/lm-sys/arena-hard-auto/tree/main/data/arena-hard-v2.0

fixes: #1151

Summary by CodeRabbit

New Features
- Arena Hard v2 dataset evaluation now available with pre-configured judge settings (arena metrics evaluation)
- Automated dataset preparation script downloads and processes Arena Hard v2 benchmark files with baseline answer integration

_{✏️ Tip: You can customize this high-level summary in your review settings.}

greptile-apps · 2026-01-06T04:53:23Z

Greptile Overview

Greptile Summary

This PR adds Arena-Hard-v2 benchmark support by creating a new dataset directory that mirrors the existing arena-hard implementation with updated data sources and the o3-mini-2025-01-31 model.

What's Added:

Dataset Configuration (__init__.py): Defines evaluation settings using arena metrics with o3-mini-2025-01-31 as both judge and baseline model, matching the pattern established by arena-hard
Data Preparation Script (prepare.py): Downloads 750 questions and baseline answers from the official lmarena/arena-hard-auto repository, processes them, and creates the test dataset
Documentation: Comprehensive usage instructions added to other-benchmarks.md

Integration Points:

Leverages existing arena_judge.py for LLM-as-a-judge evaluation with pairwise comparisons
Uses existing ArenaMetrics class for score aggregation and confidence interval calculation
Follows established dataset preparation patterns from arena-hard

Repository Update:

Updated GitHub URLs from lm-sys to lmarena organization for both arena-hard and arena-hard-v2

Critical Issues Found:
The prepare.py script contains two KeyError vulnerabilities at lines 46 and 51-52 that will cause runtime failures if the upstream data format changes or is malformed. These occur when accessing data["uid"] and data.pop("prompt") without validation.

Confidence Score: 3/5

This PR is mostly safe but contains critical bugs that will cause runtime failures if data format changes
The implementation correctly follows established patterns and integrates well with existing infrastructure (arena_judge.py, ArenaMetrics). However, the prepare.py script has two critical KeyError bugs that will crash with unhelpful error messages if the upstream data format is malformed or changes. These are production-blocking issues that must be fixed before merge. The configuration and documentation are solid.
Pay close attention to nemo_skills/dataset/arena-hard-v2/prepare.py which contains KeyError vulnerabilities that need immediate fixes

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/arena-hard-v2/prepare.py	2/5	Data preparation script with critical KeyError vulnerabilities when processing question and baseline data
nemo_skills/dataset/arena-hard-v2/init.py	5/5	Dataset configuration correctly mirrors arena-hard pattern with updated o3-mini model settings
docs/evaluation/other-benchmarks.md	5/5	Documentation added for arena-hard-v2 with proper examples, also updates arena-hard repo URL from lm-sys to lmarena
nemo_skills/dataset/arena-hard/prepare.py	5/5	Updated GitHub repository URLs from lm-sys to lmarena organization

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant GitHub as lmarena/arena-hard-auto
    participant Dataset as test.jsonl
    participant EvalPipeline as NeMo Skills Eval
    participant Judge as arena_judge.py
    participant OpenAI as OpenAI API (o3-mini)
    participant Metrics as ArenaMetrics

    User->>PrepareScript: ns prepare_data arena-hard-v2
    PrepareScript->>GitHub: Download question.jsonl
    GitHub-->>PrepareScript: 750 questions
    PrepareScript->>GitHub: Download o3-mini baseline answers
    GitHub-->>PrepareScript: Baseline answers
    PrepareScript->>PrepareScript: Parse baseline answers<br/>Extract assistant messages
    PrepareScript->>PrepareScript: Merge questions with baselines
    PrepareScript->>Dataset: Write test.jsonl<br/>(question + baseline_answer)
    
    User->>EvalPipeline: ns eval --benchmarks=arena-hard-v2
    EvalPipeline->>Dataset: Read test.jsonl
    EvalPipeline->>EvalPipeline: Generate model responses
    EvalPipeline->>Judge: Process judgements<br/>(generation + baseline_answer)
    
    Judge->>Judge: Create gen-base comparison<br/>(answer_1=generation, answer_2=baseline)
    Judge->>Judge: Create base-gen comparison<br/>(answer_1=baseline, answer_2=generation)
    
    Judge->>OpenAI: Judge gen-base pair
    OpenAI-->>Judge: Judgement [[A>>B]], [[A>B]], etc.
    Judge->>OpenAI: Judge base-gen pair (reversed)
    OpenAI-->>Judge: Judgement [[B>>A]], [[B>A]], etc.
    
    Judge-->>EvalPipeline: Return both judgements
    EvalPipeline->>Metrics: Calculate arena score
    Metrics->>Metrics: Aggregate pairwise comparisons
    Metrics-->>User: Final score with 95% CI

greptile-apps

Additional Comments (1)

nemo_skills/dataset/arena-hard-v2/prepare.py, line 52 (link)

logic: if a question has a uid that doesn't exist in baseline_answers, this will raise KeyError. add error handling:

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai · 2026-01-06T04:53:53Z

📝 Walkthrough

Walkthrough

Adds a new arena-hard-v2 dataset module with configuration constants for evaluation setup and a data preparation script that downloads question and baseline answer data, then merges them into a test dataset file.

Changes

Cohort / File(s)	Summary
Arena-hard-v2 module initialization `nemo_skills/dataset/arena-hard-v2/__init__.py`	Introduces module-level constants: `DATASET_GROUP` ("chat"), `METRICS_TYPE` ("arena"), `GENERATION_ARGS` ("++prompt_config=generic/default"), and `JUDGE_PIPELINE_ARGS` dictionary specifying arena judge model (o3-mini-2025-01-31) and OpenAI server configuration.
Arena-hard-v2 data preparation `nemo_skills/dataset/arena-hard-v2/prepare.py`	New script that downloads question and baseline answer JSONL files, extracts assistant answers by uid, remaps prompt keys to question, injects baseline answers, and outputs test.jsonl.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Add Arena-Hard-v2 benchmark support #1151 — This PR directly implements the arena-hard-v2 dataset module addition, prepare.py script, and arena judge configuration (o3-mini-2025-01-31) described in this issue.

Possibly related PRs

Add apex-shortlist dataset #1080 — Both PRs follow the same pattern of adding dataset-specific modules with DATASET_GROUP, METRICS_TYPE, GENERATION_ARGS constants and a prepare.py script that generates test.jsonl.

Suggested reviewers

gwarmstrong
Jorjeous

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add Arena-Hard-v2 benchmark support' directly and accurately summarizes the main objective of this PR, which is to introduce support for the Arena-Hard-v2 benchmark dataset and evaluation pipeline.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI Agents

In @nemo_skills/dataset/arena-hard-v2/prepare.py:
- Around line 34-46: The loop populating baseline_answers can set answer_text to
None when msg["content"] is None; update the message-handling in the messages
loop (the variables: messages, msg, content, answer_text, baseline_answers) so
that after extracting content you treat None as an empty string—for example, if
content is a dict use content.get("answer", ""), otherwise coerce falsy/None
content to "" before assigning to answer_text—then proceed with the existing
break and baseline_answers[data["uid"]] assignment.
- Around line 48-53: Wrap the per-line processing in try/except to handle
JSONDecodeError and KeyError and validate required keys: before using data,
ensure 'prompt' and 'uid' exist on the parsed dict and that baseline_answers
contains data['uid']; if any check fails, log a warning (including the offending
line or uid) and skip the record instead of raising; when valid, map
data["question"] = data.pop("prompt") and set data["baseline_answer"] =
baseline_answers[data["uid"]] as now; keep processing remaining lines so a
single bad entry or malformed JSON does not abort the whole run.
- Around line 25-32: Wrap the network downloads for URL_QUESTIONS and
URL_BASELINE in robust error handling: replace the direct
urllib.request.urlretrieve calls with a download routine that uses
urllib.request.urlopen(..., timeout=10) (or similar) and writes to the target
files inside a try/except block catching URLError/HTTPError/timeout exceptions,
logging or printing a clear error and exiting non‑zero on failure; also remove
or adjust the unnecessary data_dir.mkdir(...) call since data_dir is the script
parent and should already exist; reference URL_QUESTIONS, URL_BASELINE,
urllib.request.urlopen/urlretrieve, and data_dir in the change.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c039f5 and f28123f.

📒 Files selected for processing (2)

nemo_skills/dataset/arena-hard-v2/__init__.py
nemo_skills/dataset/arena-hard-v2/prepare.py

🧰 Additional context used

🪛 Ruff (0.14.10)

nemo_skills/dataset/arena-hard-v2/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Greptile Review

🔇 Additional comments (1)

nemo_skills/dataset/arena-hard-v2/__init__.py (1)

22-27: No action needed. The model identifier "o3-mini-2025-01-31" is confirmed as available on OpenAI's API as of January 2026, and the endpoint is correct.

Likely an incorrect or invalid review comment.

coderabbitai · 2026-01-06T04:53:56Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    questions = str(data_dir / "question.jsonl")
+    baseline = str(data_dir / "o3-mini-2025-01-31.jsonl")
+    output_file = str(data_dir / "test.jsonl")
+    urllib.request.urlretrieve(URL_QUESTIONS, questions)
+    urllib.request.urlretrieve(URL_BASELINE, baseline)


⚠️ Potential issue | 🟠 Major

Add error handling for network operations.

The download operations lack error handling and timeout configuration, which could cause the script to hang indefinitely or crash on network failures.

🔎 Proposed fix with error handling and timeout

+import socket +import urllib.error + if __name__ == "__main__": data_dir = Path(__file__).absolute().parent - data_dir.mkdir(exist_ok=True) questions = str(data_dir / "question.jsonl") baseline = str(data_dir / "o3-mini-2025-01-31.jsonl") output_file = str(data_dir / "test.jsonl") - urllib.request.urlretrieve(URL_QUESTIONS, questions) - urllib.request.urlretrieve(URL_BASELINE, baseline) + + try: + # Set a reasonable timeout for downloads + socket.setdefaulttimeout(30) + urllib.request.urlretrieve(URL_QUESTIONS, questions) + urllib.request.urlretrieve(URL_BASELINE, baseline) + except (urllib.error.URLError, OSError) as e: + print(f"Error downloading files: {e}") + raise

Note: Line 27 (data_dir.mkdir(exist_ok=True)) was removed as data_dir is the script's parent directory, which must already exist.

🧰 Tools

🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 25-32, Wrap the network downloads for URL_QUESTIONS and URL_BASELINE in robust error handling: replace the direct urllib.request.urlretrieve calls with a download routine that uses urllib.request.urlopen(..., timeout=10) (or similar) and writes to the target files inside a try/except block catching URLError/HTTPError/timeout exceptions, logging or printing a clear error and exiting non‑zero on failure; also remove or adjust the unnecessary data_dir.mkdir(...) call since data_dir is the script parent and should already exist; reference URL_QUESTIONS, URL_BASELINE, urllib.request.urlopen/urlretrieve, and data_dir in the change.

coderabbitai · 2026-01-06T04:53:56Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+    baseline_answers = {}
+    with open(baseline, "rt", encoding="utf-8") as fin:
+        for line in fin:
+            data = json.loads(line)
+            messages = data.get("messages", [])
+            answer_text = ""
+            for msg in messages:
+                if msg.get("role") == "assistant":
+                    content = msg.get("content")
+                    answer_text = content.get("answer", "") if isinstance(content, dict) else content
+                    break
+
+            baseline_answers[data["uid"]] = answer_text


⚠️ Potential issue | 🟡 Minor

Add defensive handling for None content.

Line 43 may fail to handle the case where content is None. If content is None and not a dict, answer_text would be set to None instead of an empty string, which could cause downstream issues.

🔎 Proposed fix for None handling

for msg in messages: if msg.get("role") == "assistant": content = msg.get("content") - answer_text = content.get("answer", "") if isinstance(content, dict) else content + if isinstance(content, dict): + answer_text = content.get("answer", "") + else: + answer_text = content if content is not None else "" break

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

baseline_answers = {}

with open(baseline, "rt", encoding="utf-8") as fin:

for line in fin:

data = json.loads(line)

messages = data.get("messages", [])

answer_text = ""

for msg in messages:

if msg.get("role") == "assistant":

content = msg.get("content")

answer_text = content.get("answer", "") if isinstance(content, dict) else content

break

baseline_answers[data["uid"]] = answer_text

baseline_answers = {}

with open(baseline, "rt", encoding="utf-8") as fin:

for line in fin:

data = json.loads(line)

messages = data.get("messages", [])

answer_text = ""

for msg in messages:

if msg.get("role") == "assistant":

content = msg.get("content")

if isinstance(content, dict):

answer_text = content.get("answer", "")

else:

answer_text = content if content is not None else ""

break

baseline_answers[data["uid"]] = answer_text

🤖 Prompt for AI Agents

In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 34-46, The loop populating baseline_answers can set answer_text to None when msg["content"] is None; update the message-handling in the messages loop (the variables: messages, msg, content, answer_text, baseline_answers) so that after extracting content you treat None as an empty string—for example, if content is a dict use content.get("answer", ""), otherwise coerce falsy/None content to "" before assigning to answer_text—then proceed with the existing break and baseline_answers[data["uid"]] assignment.

coderabbitai · 2026-01-06T04:53:56Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+    with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
+        for line in fin:
+            data = json.loads(line)
+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[data["uid"]]
+            fout.write(json.dumps(data) + "\n")


⚠️ Potential issue | 🟠 Major

Add validation and error handling for data processing.

Lines 51-52 may raise KeyError if the expected keys are missing or if UIDs don't match between the questions and baseline files. This could occur if the upstream data format changes or downloads are corrupted.

🔎 Proposed fix with error handling

with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: for line in fin: data = json.loads(line) - data["question"] = data.pop("prompt") - data["baseline_answer"] = baseline_answers[data["uid"]] + + # Validate expected fields + if "prompt" not in data: + print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}") + continue + if "uid" not in data: + print(f"Warning: Missing 'uid' field in question data") + continue + + uid = data["uid"] + if uid not in baseline_answers: + print(f"Warning: No baseline answer found for uid: {uid}") + continue + + data["question"] = data.pop("prompt") + data["baseline_answer"] = baseline_answers[uid] fout.write(json.dumps(data) + "\n")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:

for line in fin:

data = json.loads(line)

data["question"] = data.pop("prompt")

data["baseline_answer"] = baseline_answers[data["uid"]]

fout.write(json.dumps(data) + "\n")

with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:

for line in fin:

data = json.loads(line)

# Validate expected fields

if "prompt" not in data:

print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}")

continue

if "uid" not in data:

print(f"Warning: Missing 'uid' field in question data")

continue

uid = data["uid"]

if uid not in baseline_answers:

print(f"Warning: No baseline answer found for uid: {uid}")

continue

data["question"] = data.pop("prompt")

data["baseline_answer"] = baseline_answers[uid]

fout.write(json.dumps(data) + "\n")

🤖 Prompt for AI Agents

In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 48-53, Wrap the per-line processing in try/except to handle JSONDecodeError and KeyError and validate required keys: before using data, ensure 'prompt' and 'uid' exist on the parsed dict and that baseline_answers contains data['uid']; if any check fails, log a warning (including the offending line or uid) and skip the record instead of raising; when valid, map data["question"] = data.pop("prompt") and set data["baseline_answer"] = baseline_answers[data["uid"]] as now; keep processing remaining lines so a single bad entry or malformed JSON does not abort the whole run.

Kipok · 2026-01-06T21:17:29Z

thanks @bzantium ! Did you do any validation that results match officially reported numbers? If so, could you please share the command / summarized output?

bzantium · 2026-01-07T05:38:01Z

Hi @Kipok,

I have completed the evaluation for Qwen3-32B and Qwen3-30B-A3B using the arena-hard-v2 benchmark. The tests were run using vLLM.

Below is the execution command and the detailed JSON results.

Execution Command

ns eval --model Qwen/[Qwen3-30B-A3B, Qwen3-32B] --benchmarks arena-hard-v2 --server_gpus 2 --server_type vllm --judge_model [o3-mini-25-01-31/gpt-4.1]

Evaluation Results

1) `o3-mini-25-01-31` as judge results

1. Qwen3-32B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 37.88,
      "95_CI": [
        -1.56,
        1.9
      ],
      "invalid_scores": 1,
      "avg_tokens": 5475,
      "gen_seconds": 101
    }
  }
}

2. Qwen3-30B-A3B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 26.15,
      "95_CI": [
        -1.22,
        1.74
      ],
      "invalid_scores": 0,
      "avg_tokens": 5466,
      "gen_seconds": 100
    }
  }
}

2) `gpt-4.1` as judge results

1. Qwen3-32B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 40.83,
      "95_CI": [
        -1.75,
        1.84
      ],
      "invalid_scores": 0,
      "avg_tokens": 5475,
      "gen_seconds": 2524
    }
  }
}

2. Qwen3-30B-A3B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 25.2,
      "95_CI": [
        -1.57,
        1.32
      ],
      "invalid_scores": 1,
      "avg_tokens": 5466,
      "gen_seconds": 846
    }
  }
}

Leaderboard Score (gpt-4.1 as judge)

16                                Qwen3-32B        35.8  (-2.1 / +2.2)
17                            Qwen3-30B-A3B        28.7  (-1.4 / +2.1)

Although there is a slight discrepancy between my results and the leaderboard scores, the overall performance aligns well. Considering potential differences in hyperparameters (such as temperature) and inherent variance in generation, I believe these results are sufficiently validated.

greptile-apps

Greptile Overview

Greptile Summary

Added Arena-Hard-v2 benchmark support with o3-mini as both baseline model and judge, upgrading from arena-hard v0.1 which used gpt-4-0314.

New dataset configuration mirrors existing arena-hard pattern perfectly
Data preparation script downloads 751 questions and baseline answers from lm-sys/arena-hard-auto repository
Integration with existing arena judge pipeline and metrics system is seamless
One logical issue found: missing error handling for potential uid mismatch between questions and baseline answers

Confidence Score: 4/5

This PR is safe to merge with one minor fix needed for error handling
The implementation closely follows the established arena-hard v0.1 pattern with appropriate model updates. One logical bug was found (KeyError risk) that should be fixed to prevent runtime failures if data sources become misaligned. Otherwise the integration is clean and well-structured.
Pay attention to prepare.py - fix the KeyError handling on line 52 before merging

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/arena-hard-v2/init.py	5/5	configuration file for Arena-Hard-v2 benchmark with o3-mini judge settings - follows established pattern from arena-hard v0.1
nemo_skills/dataset/arena-hard-v2/prepare.py	3/5	data preparation script downloads questions and baseline answers - has KeyError risk on line 52 if uid mismatch occurs

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant GitHub as lm-sys/arena-hard-auto
    participant DataDir as Dataset Directory
    participant EvalPipeline as Evaluation Pipeline
    participant Judge as o3-mini Judge
    participant Metrics as Arena Metrics

    User->>PrepareScript: Run prepare.py
    PrepareScript->>GitHub: Download question.jsonl
    GitHub-->>PrepareScript: 751 questions
    PrepareScript->>GitHub: Download o3-mini baseline answers
    GitHub-->>PrepareScript: Baseline model answers
    PrepareScript->>PrepareScript: Parse baseline answers (extract from messages)
    PrepareScript->>PrepareScript: Merge questions with baseline_answer field
    PrepareScript->>DataDir: Write test.jsonl
    
    User->>EvalPipeline: Start evaluation with arena-hard-v2
    EvalPipeline->>DataDir: Load test.jsonl
    EvalPipeline->>EvalPipeline: Generate model answers
    EvalPipeline->>Judge: Send (question, model_answer, baseline_answer)
    Judge->>Judge: Compare answer_1 vs answer_2
    Judge-->>EvalPipeline: Judgement [[A>>B]], [[A>B]], [[A=B]], etc.
    Judge->>Judge: Compare with reversed order (bias check)
    Judge-->>EvalPipeline: Judgement for reversed comparison
    EvalPipeline->>Metrics: Calculate aggregate score
    Metrics-->>User: Final arena metrics (win rate, confidence intervals)

greptile-apps · 2026-01-08T05:41:39Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+        for line in fin:
+            data = json.loads(line)
+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[data["uid"]]


missing error handling - if a question's uid doesn't exist in baseline_answers, this will raise KeyError

Suggested change

data["baseline_answer"] = baseline_answers[data["uid"]]

data["baseline_answer"] = baseline_answers.get(data["uid"], "")

Kipok · 2026-01-12T20:35:14Z

thanks @bzantium, this is very helpful! We will run a little more validation on our side and can merge after that!

In the meantime, do you want to update docs to add the command / results? Would need to update in here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/other-benchmarks.md. You can see an example of how this can look like e.g. here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/code.md (rendered page is here https://nvidia-nemo.github.io/Skills/evaluation/code/)

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T01:44:14Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+        for line in fin:
+            data = json.loads(line)
+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[data["uid"]]


this line will raise a KeyError if a question's uid is not found in baseline_answers dictionary. The code should use .get() with a default value or add error handling to prevent crashes if the baseline data is incomplete or mismatched with the questions

Suggested change

data["baseline_answer"] = baseline_answers[data["uid"]]

data["baseline_answer"] = baseline_answers.get(data["uid"], "")

greptile-apps · 2026-01-13T01:44:15Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+    urllib.request.urlretrieve(URL_QUESTIONS, questions)
+    urllib.request.urlretrieve(URL_BASELINE, baseline)


network requests can fail for various reasons (connection issues, 404 errors, server unavailability, etc.) but there's no error handling. If either URL is unreachable or returns an error, the script will crash with an unclear error message. Consider wrapping these calls in a try-except block to provide better error messages to users

Suggested change

urllib.request.urlretrieve(URL_QUESTIONS, questions)

urllib.request.urlretrieve(URL_BASELINE, baseline)

try:

urllib.request.urlretrieve(URL_QUESTIONS, questions)

urllib.request.urlretrieve(URL_BASELINE, baseline)

except urllib.error.URLError as e:

print(f"Error downloading data: {e}")

raise

greptile-apps · 2026-01-13T01:44:16Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+    with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
+        for line in fin:
+            data = json.loads(line)
+            data["question"] = data.pop("prompt")


using pop("prompt") will raise a KeyError if the question data doesn't have a "prompt" field. This could happen if the upstream data format changes or is malformed. Consider using .pop("prompt", None) with validation, or .get() with a fallback

Suggested change

data["question"] = data.pop("prompt")

data["question"] = data.pop("prompt", data.get("question", ""))

Signed-off-by: bzantium <ryumin93@gmail.com>

bzantium · 2026-01-13T06:43:08Z

@Kipok Sounds good! I've updated the documentation to include the run commands and evaluation results as suggested. I included the benchmark results using the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model. Please let me know if there's anything else needed before the merge.

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/dataset/arena-hard-v2/prepare.py

docs/evaluation/other-benchmarks.md

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T08:56:46Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+                    answer_text = content.get("answer", "") if isinstance(content, dict) else content
+                    break
+
+            baseline_answers[data["uid"]] = answer_text


This line will raise a KeyError if the baseline data doesn't contain a "uid" field. The script should handle this case gracefully.

Use .get() with a fallback or add error handling:

Suggested change

baseline_answers[data["uid"]] = answer_text

baseline_answers[data.get("uid", "")] = answer_text

Or add validation earlier to ensure required fields exist.

greptile-apps · 2026-01-13T08:56:47Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[data["uid"]]


These lines will raise KeyError if the question data is missing "prompt" or "uid" fields, or if a question's uid doesn't exist in baseline_answers. This will cause the script to crash with an unhelpful error message.

Add error handling or use .get() with appropriate defaults:

Suggested change

data["question"] = data.pop("prompt")

data["baseline_answer"] = baseline_answers[data["uid"]]

prompt = data.pop("prompt", None)

if prompt is None:

continue # Skip entries without prompt

data["question"] = prompt

uid = data.get("uid")

if uid not in baseline_answers:

continue # Skip entries without matching baseline

data["baseline_answer"] = baseline_answers[uid]

Kipok · 2026-01-27T22:56:44Z

sorry, been very busy with other random things, but will try to prioritize this verification this week and hopefully we will be able to merge after that

Kipok · 2026-01-31T02:21:02Z

closing in favor of #1205

@bzantium I found a few issues which I fixed in #1205. Please give it a try and see if it works for you!

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

bzantium force-pushed the feature/#1151 branch from 420d789 to 11217e0 Compare January 13, 2026 01:39

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

feat: add Arena-Hard-v2 benchmark support

9ab83b6

Signed-off-by: bzantium <ryumin93@gmail.com>

bzantium force-pushed the feature/#1151 branch from 11217e0 to 9ab83b6 Compare January 13, 2026 06:40

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into feature/NVIDIA-NeMo#1151

d55bf4a

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Kipok mentioned this pull request Jan 31, 2026

Add arena-hard v2 #1205

Merged

Kipok closed this Jan 31, 2026

-    with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
-        for line in fin:
-            data = json.loads(line)
-            data["question"] = data.pop("prompt")
-            data["baseline_answer"] = baseline_answers[data["uid"]]
-            fout.write(json.dumps(data) + "\n")
+    with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
+        for line in fin:
+            data = json.loads(line)
+            # Validate expected fields
+            if "prompt" not in data:
+                print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}")
+                continue
+            if "uid" not in data:
+                print(f"Warning: Missing 'uid' field in question data")
+                continue
+            uid = data["uid"]
+            if uid not in baseline_answers:
+                print(f"Warning: No baseline answer found for uid: {uid}")
+                continue
+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[uid]
+            fout.write(json.dumps(data) + "\n")

	data["baseline_answer"] = baseline_answers[data["uid"]]
	data["baseline_answer"] = baseline_answers.get(data["uid"], "")

		urllib.request.urlretrieve(URL_QUESTIONS, questions)
		urllib.request.urlretrieve(URL_BASELINE, baseline)

	data["question"] = data.pop("prompt")
	data["question"] = data.pop("prompt", data.get("question", ""))

	baseline_answers[data["uid"]] = answer_text
	baseline_answers[data.get("uid", "")] = answer_text

-            data["question"] = data.pop("prompt")
-            data["baseline_answer"] = baseline_answers[data["uid"]]
+            prompt = data.pop("prompt", None)
+            if prompt is None:
+                continue  # Skip entries without prompt
+            data["question"] = prompt
+            uid = data.get("uid")
+            if uid not in baseline_answers:
+                continue  # Skip entries without matching baseline
+            data["baseline_answer"] = baseline_answers[uid]

Conversation

bzantium commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Changed

Files Added

Technical Details

Testing

References

Summary by CodeRabbit

Uh oh!

greptile-apps bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jan 6, 2026

Uh oh!

bzantium commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Execution Command

Evaluation Results

1) o3-mini-25-01-31 as judge results

2) gpt-4.1 as judge results

Leaderboard Score (gpt-4.1 as judge)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jan 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

bzantium commented Jan 6, 2026 •

edited

Loading

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

bzantium commented Jan 7, 2026 •

edited

Loading

1) `o3-mini-25-01-31` as judge results

2) `gpt-4.1` as judge results

bzantium commented Jan 13, 2026 •

edited

Loading