Skip to content

Add Arena-Hard-v2 benchmark support#1152

Closed
bzantium wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
bzantium:feature/#1151
Closed

Add Arena-Hard-v2 benchmark support#1152
bzantium wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
bzantium:feature/#1151

Conversation

@bzantium
Copy link
Contributor

@bzantium bzantium commented Jan 6, 2026

Summary

This PR adds support for the Arena-Hard-v2 benchmark, enabling evaluation of chat models using LLM-as-a-judge methodology with o3-mini as the judge.

What's Changed

  • Added arena-hard-v2 dataset with 750 test questions
  • Automated data preparation script that downloads questions and baseline answers from lm-sys/arena-hard-auto
  • Configured LLM-based judge evaluation pipeline
  • Integrated with NeMo Skills' arena metrics system

Files Added

  • nemo_skills/dataset/arena-hard-v2/__init__.py - Dataset configuration with judge pipeline settings
  • nemo_skills/dataset/arena-hard-v2/prepare.py - Data preparation script

Technical Details

  • Baseline Model: o3-mini-2025-01-31
  • Default Judge: o3-mini-2025-01-31
  • Metrics Type: Arena-style pairwise comparison
  • Dataset Size: 750 questions

Testing

  • Data preparation script runs successfully
  • Dataset loads correctly in evaluation pipeline
  • Judge integration works with OpenAI API

References

fixes: #1151

Summary by CodeRabbit

  • New Features
    • Arena Hard v2 dataset evaluation now available with pre-configured judge settings (arena metrics evaluation)
    • Automated dataset preparation script downloads and processes Arena Hard v2 benchmark files with baseline answer integration

✏️ Tip: You can customize this high-level summary in your review settings.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Overview

Greptile Summary

This PR adds Arena-Hard-v2 benchmark support by creating a new dataset directory that mirrors the existing arena-hard implementation with updated data sources and the o3-mini-2025-01-31 model.

What's Added:

  • Dataset Configuration (__init__.py): Defines evaluation settings using arena metrics with o3-mini-2025-01-31 as both judge and baseline model, matching the pattern established by arena-hard
  • Data Preparation Script (prepare.py): Downloads 750 questions and baseline answers from the official lmarena/arena-hard-auto repository, processes them, and creates the test dataset
  • Documentation: Comprehensive usage instructions added to other-benchmarks.md

Integration Points:

  • Leverages existing arena_judge.py for LLM-as-a-judge evaluation with pairwise comparisons
  • Uses existing ArenaMetrics class for score aggregation and confidence interval calculation
  • Follows established dataset preparation patterns from arena-hard

Repository Update:

  • Updated GitHub URLs from lm-sys to lmarena organization for both arena-hard and arena-hard-v2

Critical Issues Found:
The prepare.py script contains two KeyError vulnerabilities at lines 46 and 51-52 that will cause runtime failures if the upstream data format changes or is malformed. These occur when accessing data["uid"] and data.pop("prompt") without validation.

Confidence Score: 3/5

  • This PR is mostly safe but contains critical bugs that will cause runtime failures if data format changes
  • The implementation correctly follows established patterns and integrates well with existing infrastructure (arena_judge.py, ArenaMetrics). However, the prepare.py script has two critical KeyError bugs that will crash with unhelpful error messages if the upstream data format is malformed or changes. These are production-blocking issues that must be fixed before merge. The configuration and documentation are solid.
  • Pay close attention to nemo_skills/dataset/arena-hard-v2/prepare.py which contains KeyError vulnerabilities that need immediate fixes

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/arena-hard-v2/prepare.py 2/5 Data preparation script with critical KeyError vulnerabilities when processing question and baseline data
nemo_skills/dataset/arena-hard-v2/init.py 5/5 Dataset configuration correctly mirrors arena-hard pattern with updated o3-mini model settings
docs/evaluation/other-benchmarks.md 5/5 Documentation added for arena-hard-v2 with proper examples, also updates arena-hard repo URL from lm-sys to lmarena
nemo_skills/dataset/arena-hard/prepare.py 5/5 Updated GitHub repository URLs from lm-sys to lmarena organization

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant GitHub as lmarena/arena-hard-auto
    participant Dataset as test.jsonl
    participant EvalPipeline as NeMo Skills Eval
    participant Judge as arena_judge.py
    participant OpenAI as OpenAI API (o3-mini)
    participant Metrics as ArenaMetrics

    User->>PrepareScript: ns prepare_data arena-hard-v2
    PrepareScript->>GitHub: Download question.jsonl
    GitHub-->>PrepareScript: 750 questions
    PrepareScript->>GitHub: Download o3-mini baseline answers
    GitHub-->>PrepareScript: Baseline answers
    PrepareScript->>PrepareScript: Parse baseline answers<br/>Extract assistant messages
    PrepareScript->>PrepareScript: Merge questions with baselines
    PrepareScript->>Dataset: Write test.jsonl<br/>(question + baseline_answer)
    
    User->>EvalPipeline: ns eval --benchmarks=arena-hard-v2
    EvalPipeline->>Dataset: Read test.jsonl
    EvalPipeline->>EvalPipeline: Generate model responses
    EvalPipeline->>Judge: Process judgements<br/>(generation + baseline_answer)
    
    Judge->>Judge: Create gen-base comparison<br/>(answer_1=generation, answer_2=baseline)
    Judge->>Judge: Create base-gen comparison<br/>(answer_1=baseline, answer_2=generation)
    
    Judge->>OpenAI: Judge gen-base pair
    OpenAI-->>Judge: Judgement [[A>>B]], [[A>B]], etc.
    Judge->>OpenAI: Judge base-gen pair (reversed)
    OpenAI-->>Judge: Judgement [[B>>A]], [[B>A]], etc.
    
    Judge-->>EvalPipeline: Return both judgements
    EvalPipeline->>Metrics: Calculate arena score
    Metrics->>Metrics: Aggregate pairwise comparisons
    Metrics-->>User: Final score with 95% CI
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. nemo_skills/dataset/arena-hard-v2/prepare.py, line 52 (link)

    logic: if a question has a uid that doesn't exist in baseline_answers, this will raise KeyError. add error handling:

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

Adds a new arena-hard-v2 dataset module with configuration constants for evaluation setup and a data preparation script that downloads question and baseline answer data, then merges them into a test dataset file.

Changes

Cohort / File(s) Summary
Arena-hard-v2 module initialization
nemo_skills/dataset/arena-hard-v2/__init__.py
Introduces module-level constants: DATASET_GROUP ("chat"), METRICS_TYPE ("arena"), GENERATION_ARGS ("++prompt_config=generic/default"), and JUDGE_PIPELINE_ARGS dictionary specifying arena judge model (o3-mini-2025-01-31) and OpenAI server configuration.
Arena-hard-v2 data preparation
nemo_skills/dataset/arena-hard-v2/prepare.py
New script that downloads question and baseline answer JSONL files, extracts assistant answers by uid, remaps prompt keys to question, injects baseline answers, and outputs test.jsonl.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

  • Add Arena-Hard-v2 benchmark support #1151 — This PR directly implements the arena-hard-v2 dataset module addition, prepare.py script, and arena judge configuration (o3-mini-2025-01-31) described in this issue.

Possibly related PRs

  • Add apex-shortlist dataset #1080 — Both PRs follow the same pattern of adding dataset-specific modules with DATASET_GROUP, METRICS_TYPE, GENERATION_ARGS constants and a prepare.py script that generates test.jsonl.

Suggested reviewers

  • gwarmstrong
  • Jorjeous
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add Arena-Hard-v2 benchmark support' directly and accurately summarizes the main objective of this PR, which is to introduce support for the Arena-Hard-v2 benchmark dataset and evaluation pipeline.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py:
- Around line 34-46: The loop populating baseline_answers can set answer_text to
None when msg["content"] is None; update the message-handling in the messages
loop (the variables: messages, msg, content, answer_text, baseline_answers) so
that after extracting content you treat None as an empty string—for example, if
content is a dict use content.get("answer", ""), otherwise coerce falsy/None
content to "" before assigning to answer_text—then proceed with the existing
break and baseline_answers[data["uid"]] assignment.
- Around line 48-53: Wrap the per-line processing in try/except to handle
JSONDecodeError and KeyError and validate required keys: before using data,
ensure 'prompt' and 'uid' exist on the parsed dict and that baseline_answers
contains data['uid']; if any check fails, log a warning (including the offending
line or uid) and skip the record instead of raising; when valid, map
data["question"] = data.pop("prompt") and set data["baseline_answer"] =
baseline_answers[data["uid"]] as now; keep processing remaining lines so a
single bad entry or malformed JSON does not abort the whole run.
- Around line 25-32: Wrap the network downloads for URL_QUESTIONS and
URL_BASELINE in robust error handling: replace the direct
urllib.request.urlretrieve calls with a download routine that uses
urllib.request.urlopen(..., timeout=10) (or similar) and writes to the target
files inside a try/except block catching URLError/HTTPError/timeout exceptions,
logging or printing a clear error and exiting non‑zero on failure; also remove
or adjust the unnecessary data_dir.mkdir(...) call since data_dir is the script
parent and should already exist; reference URL_QUESTIONS, URL_BASELINE,
urllib.request.urlopen/urlretrieve, and data_dir in the change.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c039f5 and f28123f.

📒 Files selected for processing (2)
  • nemo_skills/dataset/arena-hard-v2/__init__.py
  • nemo_skills/dataset/arena-hard-v2/prepare.py
🧰 Additional context used
🪛 Ruff (0.14.10)
nemo_skills/dataset/arena-hard-v2/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (1)
nemo_skills/dataset/arena-hard-v2/__init__.py (1)

22-27: No action needed. The model identifier "o3-mini-2025-01-31" is confirmed as available on OpenAI's API as of January 2026, and the endpoint is correct.

Likely an incorrect or invalid review comment.

Comment on lines +25 to +32
if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
questions = str(data_dir / "question.jsonl")
baseline = str(data_dir / "o3-mini-2025-01-31.jsonl")
output_file = str(data_dir / "test.jsonl")
urllib.request.urlretrieve(URL_QUESTIONS, questions)
urllib.request.urlretrieve(URL_BASELINE, baseline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling for network operations.

The download operations lack error handling and timeout configuration, which could cause the script to hang indefinitely or crash on network failures.

🔎 Proposed fix with error handling and timeout
+import socket
+import urllib.error
+
 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
-    data_dir.mkdir(exist_ok=True)
     questions = str(data_dir / "question.jsonl")
     baseline = str(data_dir / "o3-mini-2025-01-31.jsonl")
     output_file = str(data_dir / "test.jsonl")
-    urllib.request.urlretrieve(URL_QUESTIONS, questions)
-    urllib.request.urlretrieve(URL_BASELINE, baseline)
+    
+    try:
+        # Set a reasonable timeout for downloads
+        socket.setdefaulttimeout(30)
+        urllib.request.urlretrieve(URL_QUESTIONS, questions)
+        urllib.request.urlretrieve(URL_BASELINE, baseline)
+    except (urllib.error.URLError, OSError) as e:
+        print(f"Error downloading files: {e}")
+        raise

Note: Line 27 (data_dir.mkdir(exist_ok=True)) was removed as data_dir is the script's parent directory, which must already exist.

🧰 Tools
🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 25-32, Wrap the
network downloads for URL_QUESTIONS and URL_BASELINE in robust error handling:
replace the direct urllib.request.urlretrieve calls with a download routine that
uses urllib.request.urlopen(..., timeout=10) (or similar) and writes to the
target files inside a try/except block catching URLError/HTTPError/timeout
exceptions, logging or printing a clear error and exiting non‑zero on failure;
also remove or adjust the unnecessary data_dir.mkdir(...) call since data_dir is
the script parent and should already exist; reference URL_QUESTIONS,
URL_BASELINE, urllib.request.urlopen/urlretrieve, and data_dir in the change.

Comment on lines +34 to +46
baseline_answers = {}
with open(baseline, "rt", encoding="utf-8") as fin:
for line in fin:
data = json.loads(line)
messages = data.get("messages", [])
answer_text = ""
for msg in messages:
if msg.get("role") == "assistant":
content = msg.get("content")
answer_text = content.get("answer", "") if isinstance(content, dict) else content
break

baseline_answers[data["uid"]] = answer_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add defensive handling for None content.

Line 43 may fail to handle the case where content is None. If content is None and not a dict, answer_text would be set to None instead of an empty string, which could cause downstream issues.

🔎 Proposed fix for None handling
             for msg in messages:
                 if msg.get("role") == "assistant":
                     content = msg.get("content")
-                    answer_text = content.get("answer", "") if isinstance(content, dict) else content
+                    if isinstance(content, dict):
+                        answer_text = content.get("answer", "")
+                    else:
+                        answer_text = content if content is not None else ""
                     break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
baseline_answers = {}
with open(baseline, "rt", encoding="utf-8") as fin:
for line in fin:
data = json.loads(line)
messages = data.get("messages", [])
answer_text = ""
for msg in messages:
if msg.get("role") == "assistant":
content = msg.get("content")
answer_text = content.get("answer", "") if isinstance(content, dict) else content
break
baseline_answers[data["uid"]] = answer_text
baseline_answers = {}
with open(baseline, "rt", encoding="utf-8") as fin:
for line in fin:
data = json.loads(line)
messages = data.get("messages", [])
answer_text = ""
for msg in messages:
if msg.get("role") == "assistant":
content = msg.get("content")
if isinstance(content, dict):
answer_text = content.get("answer", "")
else:
answer_text = content if content is not None else ""
break
baseline_answers[data["uid"]] = answer_text
🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 34-46, The loop
populating baseline_answers can set answer_text to None when msg["content"] is
None; update the message-handling in the messages loop (the variables: messages,
msg, content, answer_text, baseline_answers) so that after extracting content
you treat None as an empty string—for example, if content is a dict use
content.get("answer", ""), otherwise coerce falsy/None content to "" before
assigning to answer_text—then proceed with the existing break and
baseline_answers[data["uid"]] assignment.

Comment on lines +48 to +53
with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
for line in fin:
data = json.loads(line)
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
fout.write(json.dumps(data) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add validation and error handling for data processing.

Lines 51-52 may raise KeyError if the expected keys are missing or if UIDs don't match between the questions and baseline files. This could occur if the upstream data format changes or downloads are corrupted.

🔎 Proposed fix with error handling
     with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
         for line in fin:
             data = json.loads(line)
-            data["question"] = data.pop("prompt")
-            data["baseline_answer"] = baseline_answers[data["uid"]]
+            
+            # Validate expected fields
+            if "prompt" not in data:
+                print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}")
+                continue
+            if "uid" not in data:
+                print(f"Warning: Missing 'uid' field in question data")
+                continue
+            
+            uid = data["uid"]
+            if uid not in baseline_answers:
+                print(f"Warning: No baseline answer found for uid: {uid}")
+                continue
+            
+            data["question"] = data.pop("prompt")
+            data["baseline_answer"] = baseline_answers[uid]
             fout.write(json.dumps(data) + "\n")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
for line in fin:
data = json.loads(line)
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
fout.write(json.dumps(data) + "\n")
with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
for line in fin:
data = json.loads(line)
# Validate expected fields
if "prompt" not in data:
print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}")
continue
if "uid" not in data:
print(f"Warning: Missing 'uid' field in question data")
continue
uid = data["uid"]
if uid not in baseline_answers:
print(f"Warning: No baseline answer found for uid: {uid}")
continue
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[uid]
fout.write(json.dumps(data) + "\n")
🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 48-53, Wrap the
per-line processing in try/except to handle JSONDecodeError and KeyError and
validate required keys: before using data, ensure 'prompt' and 'uid' exist on
the parsed dict and that baseline_answers contains data['uid']; if any check
fails, log a warning (including the offending line or uid) and skip the record
instead of raising; when valid, map data["question"] = data.pop("prompt") and
set data["baseline_answer"] = baseline_answers[data["uid"]] as now; keep
processing remaining lines so a single bad entry or malformed JSON does not
abort the whole run.

@Kipok
Copy link
Collaborator

Kipok commented Jan 6, 2026

thanks @bzantium ! Did you do any validation that results match officially reported numbers? If so, could you please share the command / summarized output?

@bzantium
Copy link
Contributor Author

bzantium commented Jan 7, 2026

Hi @Kipok,

I have completed the evaluation for Qwen3-32B and Qwen3-30B-A3B using the arena-hard-v2 benchmark. The tests were run using vLLM.

Below is the execution command and the detailed JSON results.

Execution Command

ns eval --model Qwen/[Qwen3-30B-A3B, Qwen3-32B] --benchmarks arena-hard-v2 --server_gpus 2 --server_type vllm --judge_model [o3-mini-25-01-31/gpt-4.1]

Evaluation Results

1) o3-mini-25-01-31 as judge results

1. Qwen3-32B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 37.88,
      "95_CI": [
        -1.56,
        1.9
      ],
      "invalid_scores": 1,
      "avg_tokens": 5475,
      "gen_seconds": 101
    }
  }
}

2. Qwen3-30B-A3B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 26.15,
      "95_CI": [
        -1.22,
        1.74
      ],
      "invalid_scores": 0,
      "avg_tokens": 5466,
      "gen_seconds": 100
    }
  }
}

2) gpt-4.1 as judge results

1. Qwen3-32B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 40.83,
      "95_CI": [
        -1.75,
        1.84
      ],
      "invalid_scores": 0,
      "avg_tokens": 5475,
      "gen_seconds": 2524
    }
  }
}

2. Qwen3-30B-A3B

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 25.2,
      "95_CI": [
        -1.57,
        1.32
      ],
      "invalid_scores": 1,
      "avg_tokens": 5466,
      "gen_seconds": 846
    }
  }
}

Leaderboard Score (gpt-4.1 as judge)

16                                Qwen3-32B        35.8  (-2.1 / +2.2)
17                            Qwen3-30B-A3B        28.7  (-1.4 / +2.1)

Although there is a slight discrepancy between my results and the leaderboard scores, the overall performance aligns well. Considering potential differences in hyperparameters (such as temperature) and inherent variance in generation, I believe these results are sufficiently validated.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Added Arena-Hard-v2 benchmark support with o3-mini as both baseline model and judge, upgrading from arena-hard v0.1 which used gpt-4-0314.

  • New dataset configuration mirrors existing arena-hard pattern perfectly
  • Data preparation script downloads 751 questions and baseline answers from lm-sys/arena-hard-auto repository
  • Integration with existing arena judge pipeline and metrics system is seamless
  • One logical issue found: missing error handling for potential uid mismatch between questions and baseline answers

Confidence Score: 4/5

  • This PR is safe to merge with one minor fix needed for error handling
  • The implementation closely follows the established arena-hard v0.1 pattern with appropriate model updates. One logical bug was found (KeyError risk) that should be fixed to prevent runtime failures if data sources become misaligned. Otherwise the integration is clean and well-structured.
  • Pay attention to prepare.py - fix the KeyError handling on line 52 before merging

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/arena-hard-v2/init.py 5/5 configuration file for Arena-Hard-v2 benchmark with o3-mini judge settings - follows established pattern from arena-hard v0.1
nemo_skills/dataset/arena-hard-v2/prepare.py 3/5 data preparation script downloads questions and baseline answers - has KeyError risk on line 52 if uid mismatch occurs

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant GitHub as lm-sys/arena-hard-auto
    participant DataDir as Dataset Directory
    participant EvalPipeline as Evaluation Pipeline
    participant Judge as o3-mini Judge
    participant Metrics as Arena Metrics

    User->>PrepareScript: Run prepare.py
    PrepareScript->>GitHub: Download question.jsonl
    GitHub-->>PrepareScript: 751 questions
    PrepareScript->>GitHub: Download o3-mini baseline answers
    GitHub-->>PrepareScript: Baseline model answers
    PrepareScript->>PrepareScript: Parse baseline answers (extract from messages)
    PrepareScript->>PrepareScript: Merge questions with baseline_answer field
    PrepareScript->>DataDir: Write test.jsonl
    
    User->>EvalPipeline: Start evaluation with arena-hard-v2
    EvalPipeline->>DataDir: Load test.jsonl
    EvalPipeline->>EvalPipeline: Generate model answers
    EvalPipeline->>Judge: Send (question, model_answer, baseline_answer)
    Judge->>Judge: Compare answer_1 vs answer_2
    Judge-->>EvalPipeline: Judgement [[A>>B]], [[A>B]], [[A=B]], etc.
    Judge->>Judge: Compare with reversed order (bias check)
    Judge-->>EvalPipeline: Judgement for reversed comparison
    EvalPipeline->>Metrics: Calculate aggregate score
    Metrics-->>User: Final arena metrics (win rate, confidence intervals)
Loading

for line in fin:
data = json.loads(line)
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing error handling - if a question's uid doesn't exist in baseline_answers, this will raise KeyError

Suggested change
data["baseline_answer"] = baseline_answers[data["uid"]]
data["baseline_answer"] = baseline_answers.get(data["uid"], "")

@Kipok
Copy link
Collaborator

Kipok commented Jan 12, 2026

thanks @bzantium, this is very helpful! We will run a little more validation on our side and can merge after that!

In the meantime, do you want to update docs to add the command / results? Would need to update in here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/other-benchmarks.md. You can see an example of how this can look like e.g. here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/code.md (rendered page is here https://nvidia-nemo.github.io/Skills/evaluation/code/)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

for line in fin:
data = json.loads(line)
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line will raise a KeyError if a question's uid is not found in baseline_answers dictionary. The code should use .get() with a default value or add error handling to prevent crashes if the baseline data is incomplete or mismatched with the questions

Suggested change
data["baseline_answer"] = baseline_answers[data["uid"]]
data["baseline_answer"] = baseline_answers.get(data["uid"], "")

Comment on lines +31 to +32
urllib.request.urlretrieve(URL_QUESTIONS, questions)
urllib.request.urlretrieve(URL_BASELINE, baseline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network requests can fail for various reasons (connection issues, 404 errors, server unavailability, etc.) but there's no error handling. If either URL is unreachable or returns an error, the script will crash with an unclear error message. Consider wrapping these calls in a try-except block to provide better error messages to users

Suggested change
urllib.request.urlretrieve(URL_QUESTIONS, questions)
urllib.request.urlretrieve(URL_BASELINE, baseline)
try:
urllib.request.urlretrieve(URL_QUESTIONS, questions)
urllib.request.urlretrieve(URL_BASELINE, baseline)
except urllib.error.URLError as e:
print(f"Error downloading data: {e}")
raise

with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
for line in fin:
data = json.loads(line)
data["question"] = data.pop("prompt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using pop("prompt") will raise a KeyError if the question data doesn't have a "prompt" field. This could happen if the upstream data format changes or is malformed. Consider using .pop("prompt", None) with validation, or .get() with a fallback

Suggested change
data["question"] = data.pop("prompt")
data["question"] = data.pop("prompt", data.get("question", ""))

Signed-off-by: bzantium <ryumin93@gmail.com>
@bzantium
Copy link
Contributor Author

bzantium commented Jan 13, 2026

@Kipok Sounds good! I've updated the documentation to include the run commands and evaluation results as suggested. I included the benchmark results using the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model. Please let me know if there's anything else needed before the merge.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

answer_text = content.get("answer", "") if isinstance(content, dict) else content
break

baseline_answers[data["uid"]] = answer_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will raise a KeyError if the baseline data doesn't contain a "uid" field. The script should handle this case gracefully.

Use .get() with a fallback or add error handling:

Suggested change
baseline_answers[data["uid"]] = answer_text
baseline_answers[data.get("uid", "")] = answer_text

Or add validation earlier to ensure required fields exist.

Comment on lines +51 to +52
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines will raise KeyError if the question data is missing "prompt" or "uid" fields, or if a question's uid doesn't exist in baseline_answers. This will cause the script to crash with an unhelpful error message.

Add error handling or use .get() with appropriate defaults:

Suggested change
data["question"] = data.pop("prompt")
data["baseline_answer"] = baseline_answers[data["uid"]]
prompt = data.pop("prompt", None)
if prompt is None:
continue # Skip entries without prompt
data["question"] = prompt
uid = data.get("uid")
if uid not in baseline_answers:
continue # Skip entries without matching baseline
data["baseline_answer"] = baseline_answers[uid]

@Kipok
Copy link
Collaborator

Kipok commented Jan 27, 2026

sorry, been very busy with other random things, but will try to prioritize this verification this week and hopefully we will be able to merge after that

@Kipok Kipok mentioned this pull request Jan 31, 2026
@Kipok
Copy link
Collaborator

Kipok commented Jan 31, 2026

closing in favor of #1205

@bzantium I found a few issues which I fixed in #1205. Please give it a try and see if it works for you!

@Kipok Kipok closed this Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Arena-Hard-v2 benchmark support

2 participants