Add arena-hard v2 by Kipok · Pull Request #1205 · NVIDIA-NeMo/Skills

Kipok · 2026-01-31T02:20:00Z

Add the following fixes on top of #1152

Add an appropriate handling for different baselines for hard prompts and writing prompts
Add an appropriate judge prompt for writing which doesn't ask to give a response first
Add category in metrics reporting and tests

Summary by CodeRabbit

Release Notes

New Features
- Added evaluation support for the arena-hard-v2 benchmark with per-category metrics tracking and aggregation
- Implemented category-specific judging for improved evaluation accuracy across different task types including hard prompts and creative writing
Documentation
- Added comprehensive setup and execution guides for arena-hard and arena-hard-v2 evaluations, including sample commands, environment configuration, and results examples

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: bzantium <ryumin93@gmail.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-31T02:22:28Z

nemo_skills/inference/eval/arena_judge.py

+            # will fail if category not in category_prompts as this is unexpected
+            prompt = self.category_prompts[category]


will raise KeyError for hard_prompt category since category_prompts only contains creative_writing

Suggested change

# will fail if category not in category_prompts as this is unexpected

prompt = self.category_prompts[category]

else:

# use category-specific prompt if available, otherwise fall back to default

prompt = self.category_prompts.get(category, self.prompt)

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-31T02:27:43Z

nemo_skills/evaluation/metrics/arena_metrics.py

        super().update(predictions)
        self.scores.append([])
        self.agg_mode = f"pass@{len(predictions)}"
+
+        # Track category for per-category scoring (defaults to None for v1 compatibility)
+        category = predictions[0].get("category")
+        self.categories.append(category)
+


ArenaMetrics.update() now unconditionally appends predictions[0].get("category") to self.categories, but get_metrics() only uses per-category aggregation when len(set(self.categories)) > 1. If a mixed run includes some rows without category (or category=None) alongside real categories, unique_categories becomes {None, "..."} and you’ll emit a category_None bucket in metrics, which is usually unintended and makes downstream consumers brittle.

This shows up when evaluating older/partially-prepared datasets or when a pipeline forgets to propagate category for some items.

A safer approach is to only create per-category buckets for non-None categories (or to require that all entries have a non-null category before enabling breakdown).

greptile-apps · 2026-01-31T02:27:44Z

nemo_skills/inference/eval/arena_judge.py

+    def setup_prompt(self):
+        if self.cfg.prompt_format == "openai":
+            return None
+


setup_prompt() returns None for prompt_format == "openai", but fill_prompt() for openai delegates to super().fill_prompt(...) without any category-specific override. If arena-hard-v2 is run with prompt_format=openai, the creative_writing category will silently use the default judge prompt (and may reintroduce the “judge generates its own answer first” behavior this PR is trying to avoid).

If category-specific judging is required for v2, this likely needs an equivalent branch for the OpenAI prompt path (e.g., selecting different message templates/configs per category) or explicitly disallowing openai format for v2.

greptile-apps · 2026-01-31T02:27:45Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+    # Create test.jsonl with category-specific baseline answers
+    with open(questions_file, "rt", encoding="utf-8") as fin, open(output_file, "wt", encoding="utf-8") as fout:
+        for line in fin:
+            data = json.loads(line)
+            data["question"] = data.pop("prompt")
+            category = data["category"]
+            data["baseline_answer"] = baseline_answers[data["uid"]][category]
+            fout.write(json.dumps(data) + "\n")


prepare.py assumes every question uid exists in baseline_answers and that it has an entry for the question’s category (baseline_answers[data["uid"]][category]). If the upstream dataset adds a new category, or if a baseline file is missing/partial, this will raise KeyError and stop dataset preparation.

Given v2 explicitly has multiple baselines by category, it would be safer to fail with a clearer error that prints the missing (uid, category) (or to handle unknown categories explicitly) so users can debug mismatched dataset/baseline versions.

coderabbitai · 2026-01-31T02:30:07Z

📝 Walkthrough

Walkthrough

This PR introduces arena-hard-v2 benchmark evaluation support, including new data preparation scripts, dataset configuration, per-category metrics tracking in arena metrics, creative writing prompt support in the arena judge, and comprehensive test coverage for the new functionality.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/other-benchmarks.md`	Added evaluation setup details for arena-hard and arena-hard-v2, including default judge models, data preparation commands, execution examples with environment variables, sample results with nested category metrics, and vllm server configuration.
Arena-Hard-V2 Dataset `nemo_skills/dataset/arena-hard-v2/__init__.py`, `nemo_skills/dataset/arena-hard-v2/prepare.py`	Introduced new arena-hard-v2 module with default evaluation configuration (dataset_group, metrics_type, generation args, judge pipeline) and automated data preparation script that downloads questions and category-specific baselines, extracts answers, and generates enriched test dataset.
Arena-Hard Data Source `nemo_skills/dataset/arena-hard/prepare.py`	Updated URL constants for questions and baseline data to point to new repository location (lmarena instead of lm-sys).
Arena Metrics Enhancement `nemo_skills/evaluation/metrics/arena_metrics.py`	Added per-category metrics tracking and aggregation; categories are now extracted from predictions and metrics are computed separately for each category when multiple categories exist.
Arena Judge & Prompts `nemo_skills/inference/eval/arena_judge.py`, `nemo_skills/prompt/config/judge/arena_creative.yaml`	Introduced category-aware prompt loading in arena judge with creative writing prompt override; added setup_prompt and fill_prompt methods to handle category-specific prompt selection and rendering.
Arena Metrics Tests `tests/test_arena_metrics.py`	Added comprehensive test suite validating per-category scoring for arena-hard-v2, single-category handling for v1, data without categories, score parsing, and invalid score handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

run GPU tests

Suggested reviewers

ekmb
titu1994

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add arena-hard v2' directly and clearly describes the main change: introducing arena-hard v2 support to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/arena-hard-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_skills/inference/eval/arena_judge.py`:
- Around line 104-110: The code currently does direct lookup
self.category_prompts[category], which raises KeyError for categories like
"hard_prompt" that are present but unmapped; change the selection to safely
fallback to the default prompt by using a membership check or dict.get so that
prompt = self.category_prompts.get(category, self.prompt) (or if you prefer
check `if category in self.category_prompts` then assign accordingly) ensuring
data_point.get("category"), self.category_prompts, and self.prompt are used as
the referenced symbols.

🧹 Nitpick comments (4)

nemo_skills/evaluation/metrics/arena_metrics.py (1)
103-105: Consider adding strict=True to the zip call for defensive programming.

Since self.scores and self.categories are populated together in update(), their lengths should always match. Adding strict=True would catch any future bugs where they diverge.
Proposed fix
-        for score, category in zip(self.scores, self.categories):
+        for score, category in zip(self.scores, self.categories, strict=True):
docs/evaluation/other-benchmarks.md (1)
45-49: Add language specifier to the fenced code block.

The code block at line 45 is missing a language specifier for the results table.
Proposed fix
-```
+```text
 ------------------------------------------- arena-hard -------------------------------------------
tests/test_arena_metrics.py (1)

31-154: Consider adding arena-hard-v2 to GPU and slurm test suites.

The new per-category scoring logic and category-specific baselines (hard_prompt and creative_writing) represent significant evaluation changes. Based on CONTRIBUTING.md guidance, new benchmarks with complicated evaluation/metrics logic should be added to slurm tests for comprehensive evaluation, and GPU tests should be run in CI.
nemo_skills/dataset/arena-hard-v2/prepare.py (1)
31-38: Returning empty string silently masks missing assistant messages.

If no assistant message is found, returning "" will silently propagate an empty baseline answer. Per coding guidelines, errors should surface rather than fail silently. Consider raising an error or removing the fallback entirely to let the StopIteration or KeyError surface naturally.
♻️ Proposed refactor to let errors surface
 def extract_answer_text(data):
     """Extract the answer text from the baseline model's response format."""
     messages = data["messages"]
     for msg in messages:
         if msg["role"] == "assistant":
             content = msg["content"]
             return content["answer"] if isinstance(content, dict) else content
-    return ""
+    raise ValueError(f"No assistant message found in baseline data: {data.get('uid', 'unknown')}")
As per coding guidelines: "Follow the Zen of Python principles: prioritize... letting errors surface rather than silently failing".

coderabbitai · 2026-01-31T02:30:10Z

nemo_skills/inference/eval/arena_judge.py

+        # Select the appropriate prompt based on category. If not defined, forcing fall-back to default prompt
+        category = data_point.get("category")
+        if not category:
+            prompt = self.prompt
+        else:
+            # will fail if category not in category_prompts as this is unexpected
+            prompt = self.category_prompts[category]


⚠️ Potential issue | 🔴 Critical

Bug: hard_prompt category will raise KeyError instead of falling back to default prompt.

The current logic falls back to default prompt only when category is falsy (None/empty). However, for categories like hard_prompt that exist but aren't in category_prompts, line 110 will raise a KeyError.

Based on the comment at line 75-76 ("default prompt used for most categories including hard_prompt"), the intent is to fall back to default for unmapped categories.

Proposed fix

# Select the appropriate prompt based on category. If not defined, forcing fall-back to default prompt category = data_point.get("category") - if not category: - prompt = self.prompt - else: - # will fail if category not in category_prompts as this is unexpected - prompt = self.category_prompts[category] + if category and category in self.category_prompts: + prompt = self.category_prompts[category] + else: + prompt = self.prompt

🤖 Prompt for AI Agents

In `@nemo_skills/inference/eval/arena_judge.py` around lines 104 - 110, The code currently does direct lookup self.category_prompts[category], which raises KeyError for categories like "hard_prompt" that are present but unmapped; change the selection to safely fallback to the default prompt by using a membership check or dict.get so that prompt = self.category_prompts.get(category, self.prompt) (or if you prefer check `if category in self.category_prompts` then assign accordingly) ensuring data_point.get("category"), self.category_prompts, and self.prompt are used as the referenced symbols.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-31T05:21:54Z

nemo_skills/dataset/arena-hard-v2/prepare.py

+            data = json.loads(line)
+            data["question"] = data.pop("prompt")
+            category = data["category"]
+            data["baseline_answer"] = baseline_answers[data["uid"]][category]


assumes every uid has a baseline for its category, will raise KeyError if baseline data is missing/partial

Suggested change

data["baseline_answer"] = baseline_answers[data["uid"]][category]

if data["uid"] not in baseline_answers or category not in baseline_answers[data["uid"]]:

raise ValueError(f"Missing baseline answer for uid={data['uid']}, category={category}")

data["baseline_answer"] = baseline_answers[data["uid"]][category]

greptile-apps · 2026-01-31T05:21:55Z

nemo_skills/inference/eval/arena_judge.py

+            prompt = self.prompt
+        else:
+            # will fail if category not in category_prompts as this is unexpected
+            prompt = self.category_prompts[category]


will raise KeyError if category isn't in category_prompts (e.g., unknown category from dataset)

Per CONTRIBUTING.md guidelines: "Don't be overly defensive" - let it fail with a clear error. However, the error message should indicate which category is missing.

Suggested change

prompt = self.category_prompts[category]

# will fail if category not in category_prompts as this is unexpected

if category not in self.category_prompts:

raise KeyError(f"Category '{category}' not found in category_prompts. Available: {list(self.category_prompts.keys())}")

prompt = self.category_prompts[category]

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-31T05:21:56Z

nemo_skills/evaluation/metrics/arena_metrics.py

+        if len(unique_categories) > 1:
+            for category, scores in category_scores.items():
+                cat_metrics = {"num_entries": len(scores)}
+                cat_metrics.update(get_aggregate_score(scores))
+                overall_metrics[f"category_{category}"] = cat_metrics


when mixed data includes category=None, creates category_None bucket in metrics which is brittle for downstream consumers

If arena-hard-v1 data (no category) is mixed with v2 data (with categories), unique_categories becomes {None, "hard_prompt", "creative_writing"} and you'll emit category_None in the output.

Consider only creating per-category buckets for non-None categories:

Suggested change

if len(unique_categories) > 1:

for category, scores in category_scores.items():

cat_metrics = {"num_entries": len(scores)}

cat_metrics.update(get_aggregate_score(scores))

overall_metrics[f"category_{category}"] = cat_metrics

# If we have multiple categories, compute per-category metrics

unique_categories = set(self.categories)

if len(unique_categories) > 1:

for category, scores in category_scores.items():

if category is not None: # Skip None category to avoid brittle category_None buckets

cat_metrics = {"num_entries": len(scores)}

cat_metrics.update(get_aggregate_score(scores))

overall_metrics[f"category_{category}"] = cat_metrics

gwarmstrong

Looks pretty good--one minor question/comment

nemo_skills/inference/eval/arena_judge.py

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com>

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

bzantium and others added 9 commits January 13, 2026 15:40

feat: add Arena-Hard-v2 benchmark support

9ab83b6

Signed-off-by: bzantium <ryumin93@gmail.com>

Merge branch 'main' into feature/#1151

d55bf4a

Merge branch 'main' into igitman/arena-hard-v2

9509a8c

Fix problems in arena-hard-v2

9f8a345

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add category metrics

6b4f9dc

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add arena metrics tests

ca1ad87

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Adjust docs

ddd2da4

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add failure if no category

b26c858

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update docs

443c6aa

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok requested a review from gwarmstrong January 31, 2026 02:20

Kipok mentioned this pull request Jan 31, 2026

Add Arena-Hard-v2 benchmark support #1152

Closed

3 tasks

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

Fixes

0e8b35f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

Kipok added the run GPU tests label Jan 31, 2026

Kipok added 2 commits January 30, 2026 21:15

Merge branch 'main' into igitman/arena-hard-v2

a6886fe

Add more strict checks

d4aff02

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added run GPU tests and removed run GPU tests labels Jan 31, 2026

greptile-apps bot reviewed Jan 31, 2026

View reviewed changes

gwarmstrong approved these changes Feb 3, 2026

View reviewed changes

nemo_skills/inference/eval/arena_judge.py Show resolved Hide resolved

Kipok merged commit d820200 into main Feb 3, 2026
5 of 6 checks passed

Kipok deleted the igitman/arena-hard-v2 branch February 3, 2026 18:43

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Add arena-hard v2 (#1205)

e23d988

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Add arena-hard v2 (#1205)

d6c0dc6

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

		# will fail if category not in category_prompts as this is unexpected
		prompt = self.category_prompts[category]

Conversation

Kipok commented Jan 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 31, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kipok commented Jan 31, 2026 •

edited by coderabbitai bot

Loading