Add Arena-Hard-v2 benchmark support#1152
Conversation
Greptile OverviewGreptile SummaryThis PR adds Arena-Hard-v2 benchmark support by creating a new dataset directory that mirrors the existing What's Added:
Integration Points:
Repository Update:
Critical Issues Found: Confidence Score: 3/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant GitHub as lmarena/arena-hard-auto
participant Dataset as test.jsonl
participant EvalPipeline as NeMo Skills Eval
participant Judge as arena_judge.py
participant OpenAI as OpenAI API (o3-mini)
participant Metrics as ArenaMetrics
User->>PrepareScript: ns prepare_data arena-hard-v2
PrepareScript->>GitHub: Download question.jsonl
GitHub-->>PrepareScript: 750 questions
PrepareScript->>GitHub: Download o3-mini baseline answers
GitHub-->>PrepareScript: Baseline answers
PrepareScript->>PrepareScript: Parse baseline answers<br/>Extract assistant messages
PrepareScript->>PrepareScript: Merge questions with baselines
PrepareScript->>Dataset: Write test.jsonl<br/>(question + baseline_answer)
User->>EvalPipeline: ns eval --benchmarks=arena-hard-v2
EvalPipeline->>Dataset: Read test.jsonl
EvalPipeline->>EvalPipeline: Generate model responses
EvalPipeline->>Judge: Process judgements<br/>(generation + baseline_answer)
Judge->>Judge: Create gen-base comparison<br/>(answer_1=generation, answer_2=baseline)
Judge->>Judge: Create base-gen comparison<br/>(answer_1=baseline, answer_2=generation)
Judge->>OpenAI: Judge gen-base pair
OpenAI-->>Judge: Judgement [[A>>B]], [[A>B]], etc.
Judge->>OpenAI: Judge base-gen pair (reversed)
OpenAI-->>Judge: Judgement [[B>>A]], [[B>A]], etc.
Judge-->>EvalPipeline: Return both judgements
EvalPipeline->>Metrics: Calculate arena score
Metrics->>Metrics: Aggregate pairwise comparisons
Metrics-->>User: Final score with 95% CI
|
There was a problem hiding this comment.
Additional Comments (1)
-
nemo_skills/dataset/arena-hard-v2/prepare.py, line 52 (link)logic: if a question has a
uidthat doesn't exist inbaseline_answers, this will raiseKeyError. add error handling:
2 files reviewed, 1 comment
📝 WalkthroughWalkthroughAdds a new arena-hard-v2 dataset module with configuration constants for evaluation setup and a data preparation script that downloads question and baseline answer data, then merges them into a test dataset file. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py:
- Around line 34-46: The loop populating baseline_answers can set answer_text to
None when msg["content"] is None; update the message-handling in the messages
loop (the variables: messages, msg, content, answer_text, baseline_answers) so
that after extracting content you treat None as an empty string—for example, if
content is a dict use content.get("answer", ""), otherwise coerce falsy/None
content to "" before assigning to answer_text—then proceed with the existing
break and baseline_answers[data["uid"]] assignment.
- Around line 48-53: Wrap the per-line processing in try/except to handle
JSONDecodeError and KeyError and validate required keys: before using data,
ensure 'prompt' and 'uid' exist on the parsed dict and that baseline_answers
contains data['uid']; if any check fails, log a warning (including the offending
line or uid) and skip the record instead of raising; when valid, map
data["question"] = data.pop("prompt") and set data["baseline_answer"] =
baseline_answers[data["uid"]] as now; keep processing remaining lines so a
single bad entry or malformed JSON does not abort the whole run.
- Around line 25-32: Wrap the network downloads for URL_QUESTIONS and
URL_BASELINE in robust error handling: replace the direct
urllib.request.urlretrieve calls with a download routine that uses
urllib.request.urlopen(..., timeout=10) (or similar) and writes to the target
files inside a try/except block catching URLError/HTTPError/timeout exceptions,
logging or printing a clear error and exiting non‑zero on failure; also remove
or adjust the unnecessary data_dir.mkdir(...) call since data_dir is the script
parent and should already exist; reference URL_QUESTIONS, URL_BASELINE,
urllib.request.urlopen/urlretrieve, and data_dir in the change.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/arena-hard-v2/__init__.pynemo_skills/dataset/arena-hard-v2/prepare.py
🧰 Additional context used
🪛 Ruff (0.14.10)
nemo_skills/dataset/arena-hard-v2/prepare.py
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🔇 Additional comments (1)
nemo_skills/dataset/arena-hard-v2/__init__.py (1)
22-27: No action needed. The model identifier "o3-mini-2025-01-31" is confirmed as available on OpenAI's API as of January 2026, and the endpoint is correct.Likely an incorrect or invalid review comment.
| if __name__ == "__main__": | ||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
| questions = str(data_dir / "question.jsonl") | ||
| baseline = str(data_dir / "o3-mini-2025-01-31.jsonl") | ||
| output_file = str(data_dir / "test.jsonl") | ||
| urllib.request.urlretrieve(URL_QUESTIONS, questions) | ||
| urllib.request.urlretrieve(URL_BASELINE, baseline) |
There was a problem hiding this comment.
Add error handling for network operations.
The download operations lack error handling and timeout configuration, which could cause the script to hang indefinitely or crash on network failures.
🔎 Proposed fix with error handling and timeout
+import socket
+import urllib.error
+
if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
- data_dir.mkdir(exist_ok=True)
questions = str(data_dir / "question.jsonl")
baseline = str(data_dir / "o3-mini-2025-01-31.jsonl")
output_file = str(data_dir / "test.jsonl")
- urllib.request.urlretrieve(URL_QUESTIONS, questions)
- urllib.request.urlretrieve(URL_BASELINE, baseline)
+
+ try:
+ # Set a reasonable timeout for downloads
+ socket.setdefaulttimeout(30)
+ urllib.request.urlretrieve(URL_QUESTIONS, questions)
+ urllib.request.urlretrieve(URL_BASELINE, baseline)
+ except (urllib.error.URLError, OSError) as e:
+ print(f"Error downloading files: {e}")
+ raiseNote: Line 27 (data_dir.mkdir(exist_ok=True)) was removed as data_dir is the script's parent directory, which must already exist.
🧰 Tools
🪛 Ruff (0.14.10)
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
32-32: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 25-32, Wrap the
network downloads for URL_QUESTIONS and URL_BASELINE in robust error handling:
replace the direct urllib.request.urlretrieve calls with a download routine that
uses urllib.request.urlopen(..., timeout=10) (or similar) and writes to the
target files inside a try/except block catching URLError/HTTPError/timeout
exceptions, logging or printing a clear error and exiting non‑zero on failure;
also remove or adjust the unnecessary data_dir.mkdir(...) call since data_dir is
the script parent and should already exist; reference URL_QUESTIONS,
URL_BASELINE, urllib.request.urlopen/urlretrieve, and data_dir in the change.
| baseline_answers = {} | ||
| with open(baseline, "rt", encoding="utf-8") as fin: | ||
| for line in fin: | ||
| data = json.loads(line) | ||
| messages = data.get("messages", []) | ||
| answer_text = "" | ||
| for msg in messages: | ||
| if msg.get("role") == "assistant": | ||
| content = msg.get("content") | ||
| answer_text = content.get("answer", "") if isinstance(content, dict) else content | ||
| break | ||
|
|
||
| baseline_answers[data["uid"]] = answer_text |
There was a problem hiding this comment.
Add defensive handling for None content.
Line 43 may fail to handle the case where content is None. If content is None and not a dict, answer_text would be set to None instead of an empty string, which could cause downstream issues.
🔎 Proposed fix for None handling
for msg in messages:
if msg.get("role") == "assistant":
content = msg.get("content")
- answer_text = content.get("answer", "") if isinstance(content, dict) else content
+ if isinstance(content, dict):
+ answer_text = content.get("answer", "")
+ else:
+ answer_text = content if content is not None else ""
break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| baseline_answers = {} | |
| with open(baseline, "rt", encoding="utf-8") as fin: | |
| for line in fin: | |
| data = json.loads(line) | |
| messages = data.get("messages", []) | |
| answer_text = "" | |
| for msg in messages: | |
| if msg.get("role") == "assistant": | |
| content = msg.get("content") | |
| answer_text = content.get("answer", "") if isinstance(content, dict) else content | |
| break | |
| baseline_answers[data["uid"]] = answer_text | |
| baseline_answers = {} | |
| with open(baseline, "rt", encoding="utf-8") as fin: | |
| for line in fin: | |
| data = json.loads(line) | |
| messages = data.get("messages", []) | |
| answer_text = "" | |
| for msg in messages: | |
| if msg.get("role") == "assistant": | |
| content = msg.get("content") | |
| if isinstance(content, dict): | |
| answer_text = content.get("answer", "") | |
| else: | |
| answer_text = content if content is not None else "" | |
| break | |
| baseline_answers[data["uid"]] = answer_text |
🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 34-46, The loop
populating baseline_answers can set answer_text to None when msg["content"] is
None; update the message-handling in the messages loop (the variables: messages,
msg, content, answer_text, baseline_answers) so that after extracting content
you treat None as an empty string—for example, if content is a dict use
content.get("answer", ""), otherwise coerce falsy/None content to "" before
assigning to answer_text—then proceed with the existing break and
baseline_answers[data["uid"]] assignment.
| with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: | ||
| for line in fin: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| data["baseline_answer"] = baseline_answers[data["uid"]] | ||
| fout.write(json.dumps(data) + "\n") |
There was a problem hiding this comment.
Add validation and error handling for data processing.
Lines 51-52 may raise KeyError if the expected keys are missing or if UIDs don't match between the questions and baseline files. This could occur if the upstream data format changes or downloads are corrupted.
🔎 Proposed fix with error handling
with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
for line in fin:
data = json.loads(line)
- data["question"] = data.pop("prompt")
- data["baseline_answer"] = baseline_answers[data["uid"]]
+
+ # Validate expected fields
+ if "prompt" not in data:
+ print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}")
+ continue
+ if "uid" not in data:
+ print(f"Warning: Missing 'uid' field in question data")
+ continue
+
+ uid = data["uid"]
+ if uid not in baseline_answers:
+ print(f"Warning: No baseline answer found for uid: {uid}")
+ continue
+
+ data["question"] = data.pop("prompt")
+ data["baseline_answer"] = baseline_answers[uid]
fout.write(json.dumps(data) + "\n")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: | |
| for line in fin: | |
| data = json.loads(line) | |
| data["question"] = data.pop("prompt") | |
| data["baseline_answer"] = baseline_answers[data["uid"]] | |
| fout.write(json.dumps(data) + "\n") | |
| with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: | |
| for line in fin: | |
| data = json.loads(line) | |
| # Validate expected fields | |
| if "prompt" not in data: | |
| print(f"Warning: Missing 'prompt' field in question: {data.get('uid', 'unknown')}") | |
| continue | |
| if "uid" not in data: | |
| print(f"Warning: Missing 'uid' field in question data") | |
| continue | |
| uid = data["uid"] | |
| if uid not in baseline_answers: | |
| print(f"Warning: No baseline answer found for uid: {uid}") | |
| continue | |
| data["question"] = data.pop("prompt") | |
| data["baseline_answer"] = baseline_answers[uid] | |
| fout.write(json.dumps(data) + "\n") |
🤖 Prompt for AI Agents
In @nemo_skills/dataset/arena-hard-v2/prepare.py around lines 48-53, Wrap the
per-line processing in try/except to handle JSONDecodeError and KeyError and
validate required keys: before using data, ensure 'prompt' and 'uid' exist on
the parsed dict and that baseline_answers contains data['uid']; if any check
fails, log a warning (including the offending line or uid) and skip the record
instead of raising; when valid, map data["question"] = data.pop("prompt") and
set data["baseline_answer"] = baseline_answers[data["uid"]] as now; keep
processing remaining lines so a single bad entry or malformed JSON does not
abort the whole run.
|
thanks @bzantium ! Did you do any validation that results match officially reported numbers? If so, could you please share the command / summarized output? |
|
Hi @Kipok, I have completed the evaluation for Qwen3-32B and Qwen3-30B-A3B using the Below is the execution command and the detailed JSON results. Execution Commandns eval --model Qwen/[Qwen3-30B-A3B, Qwen3-32B] --benchmarks arena-hard-v2 --server_gpus 2 --server_type vllm --judge_model [o3-mini-25-01-31/gpt-4.1]Evaluation Results1)
|
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Added Arena-Hard-v2 benchmark support with o3-mini as both baseline model and judge, upgrading from arena-hard v0.1 which used gpt-4-0314.
- New dataset configuration mirrors existing arena-hard pattern perfectly
- Data preparation script downloads 751 questions and baseline answers from lm-sys/arena-hard-auto repository
- Integration with existing arena judge pipeline and metrics system is seamless
- One logical issue found: missing error handling for potential uid mismatch between questions and baseline answers
Confidence Score: 4/5
- This PR is safe to merge with one minor fix needed for error handling
- The implementation closely follows the established arena-hard v0.1 pattern with appropriate model updates. One logical bug was found (KeyError risk) that should be fixed to prevent runtime failures if data sources become misaligned. Otherwise the integration is clean and well-structured.
- Pay attention to
prepare.py- fix the KeyError handling on line 52 before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/arena-hard-v2/init.py | 5/5 | configuration file for Arena-Hard-v2 benchmark with o3-mini judge settings - follows established pattern from arena-hard v0.1 |
| nemo_skills/dataset/arena-hard-v2/prepare.py | 3/5 | data preparation script downloads questions and baseline answers - has KeyError risk on line 52 if uid mismatch occurs |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant GitHub as lm-sys/arena-hard-auto
participant DataDir as Dataset Directory
participant EvalPipeline as Evaluation Pipeline
participant Judge as o3-mini Judge
participant Metrics as Arena Metrics
User->>PrepareScript: Run prepare.py
PrepareScript->>GitHub: Download question.jsonl
GitHub-->>PrepareScript: 751 questions
PrepareScript->>GitHub: Download o3-mini baseline answers
GitHub-->>PrepareScript: Baseline model answers
PrepareScript->>PrepareScript: Parse baseline answers (extract from messages)
PrepareScript->>PrepareScript: Merge questions with baseline_answer field
PrepareScript->>DataDir: Write test.jsonl
User->>EvalPipeline: Start evaluation with arena-hard-v2
EvalPipeline->>DataDir: Load test.jsonl
EvalPipeline->>EvalPipeline: Generate model answers
EvalPipeline->>Judge: Send (question, model_answer, baseline_answer)
Judge->>Judge: Compare answer_1 vs answer_2
Judge-->>EvalPipeline: Judgement [[A>>B]], [[A>B]], [[A=B]], etc.
Judge->>Judge: Compare with reversed order (bias check)
Judge-->>EvalPipeline: Judgement for reversed comparison
EvalPipeline->>Metrics: Calculate aggregate score
Metrics-->>User: Final arena metrics (win rate, confidence intervals)
| for line in fin: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| data["baseline_answer"] = baseline_answers[data["uid"]] |
There was a problem hiding this comment.
missing error handling - if a question's uid doesn't exist in baseline_answers, this will raise KeyError
| data["baseline_answer"] = baseline_answers[data["uid"]] | |
| data["baseline_answer"] = baseline_answers.get(data["uid"], "") |
|
thanks @bzantium, this is very helpful! We will run a little more validation on our side and can merge after that! In the meantime, do you want to update docs to add the command / results? Would need to update in here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/other-benchmarks.md. You can see an example of how this can look like e.g. here https://github.com/NVIDIA-NeMo/Skills/blob/main/docs/evaluation/code.md (rendered page is here https://nvidia-nemo.github.io/Skills/evaluation/code/) |
420d789 to
11217e0
Compare
| for line in fin: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| data["baseline_answer"] = baseline_answers[data["uid"]] |
There was a problem hiding this comment.
this line will raise a KeyError if a question's uid is not found in baseline_answers dictionary. The code should use .get() with a default value or add error handling to prevent crashes if the baseline data is incomplete or mismatched with the questions
| data["baseline_answer"] = baseline_answers[data["uid"]] | |
| data["baseline_answer"] = baseline_answers.get(data["uid"], "") |
| urllib.request.urlretrieve(URL_QUESTIONS, questions) | ||
| urllib.request.urlretrieve(URL_BASELINE, baseline) |
There was a problem hiding this comment.
network requests can fail for various reasons (connection issues, 404 errors, server unavailability, etc.) but there's no error handling. If either URL is unreachable or returns an error, the script will crash with an unclear error message. Consider wrapping these calls in a try-except block to provide better error messages to users
| urllib.request.urlretrieve(URL_QUESTIONS, questions) | |
| urllib.request.urlretrieve(URL_BASELINE, baseline) | |
| try: | |
| urllib.request.urlretrieve(URL_QUESTIONS, questions) | |
| urllib.request.urlretrieve(URL_BASELINE, baseline) | |
| except urllib.error.URLError as e: | |
| print(f"Error downloading data: {e}") | |
| raise |
| with open(questions, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout: | ||
| for line in fin: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") |
There was a problem hiding this comment.
using pop("prompt") will raise a KeyError if the question data doesn't have a "prompt" field. This could happen if the upstream data format changes or is malformed. Consider using .pop("prompt", None) with validation, or .get() with a fallback
| data["question"] = data.pop("prompt") | |
| data["question"] = data.pop("prompt", data.get("question", "")) |
Signed-off-by: bzantium <ryumin93@gmail.com>
11217e0 to
9ab83b6
Compare
|
@Kipok Sounds good! I've updated the documentation to include the run commands and evaluation results as suggested. I included the benchmark results using the |
| answer_text = content.get("answer", "") if isinstance(content, dict) else content | ||
| break | ||
|
|
||
| baseline_answers[data["uid"]] = answer_text |
There was a problem hiding this comment.
This line will raise a KeyError if the baseline data doesn't contain a "uid" field. The script should handle this case gracefully.
Use .get() with a fallback or add error handling:
| baseline_answers[data["uid"]] = answer_text | |
| baseline_answers[data.get("uid", "")] = answer_text |
Or add validation earlier to ensure required fields exist.
| data["question"] = data.pop("prompt") | ||
| data["baseline_answer"] = baseline_answers[data["uid"]] |
There was a problem hiding this comment.
These lines will raise KeyError if the question data is missing "prompt" or "uid" fields, or if a question's uid doesn't exist in baseline_answers. This will cause the script to crash with an unhelpful error message.
Add error handling or use .get() with appropriate defaults:
| data["question"] = data.pop("prompt") | |
| data["baseline_answer"] = baseline_answers[data["uid"]] | |
| prompt = data.pop("prompt", None) | |
| if prompt is None: | |
| continue # Skip entries without prompt | |
| data["question"] = prompt | |
| uid = data.get("uid") | |
| if uid not in baseline_answers: | |
| continue # Skip entries without matching baseline | |
| data["baseline_answer"] = baseline_answers[uid] |
|
sorry, been very busy with other random things, but will try to prioritize this verification this week and hopefully we will be able to merge after that |
Summary
This PR adds support for the Arena-Hard-v2 benchmark, enabling evaluation of chat models using LLM-as-a-judge methodology with o3-mini as the judge.
What's Changed
arena-hard-v2dataset with 750 test questionsFiles Added
nemo_skills/dataset/arena-hard-v2/__init__.py- Dataset configuration with judge pipeline settingsnemo_skills/dataset/arena-hard-v2/prepare.py- Data preparation scriptTechnical Details
Testing
References
fixes: #1151
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.