first version of 3 imo-bench datasets by stephencge · Pull Request #1047 · NVIDIA-NeMo/Skills

stephencge · 2025-11-19T02:27:12Z

Summary by CodeRabbit

New Features
- Added three new mathematical evaluation datasets for IMO problem benchmarking: AnswerBench (answer verification), GradingBench (solution grading), and ProofBench (proof evaluation)
- Included automated dataset preparation and configuration tools
- Added evaluation configurations supporting deterministic autograding and structured scoring workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gwarmstrong · 2025-11-19T19:19:52Z

nemo_skills/dataset/imo-answerbench/prepare.py

+    import io
+    import urllib.request


move to top?

gwarmstrong · 2025-11-19T19:20:36Z

nemo_skills/dataset/imo-answerbench/prepare.py

+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")


it might be better to isolate the sources of failure?

same elsewhere

it might be worthwile having a shared utility somewhere since the pattern looks similar across all the benchmarks?

or we could potentially not catch an exception at all since it would be due to network error most likely, and the network error is self-explanatory.

gwarmstrong · 2025-11-19T19:21:19Z

nemo_skills/dataset/imo-answerbench/__init__.py

@@ -0,0 +1,26 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


Suggested change

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

also for other files

smahdavi4 · 2025-11-19T19:51:00Z

nemo_skills/dataset/imo-gradingbench/prepare.py

+                        "problem_id": row["Problem ID"],
+                        "problem_statement": row["Problem"],
+                        "solution": row["Solution"],
+                        "grading_guidelines": row["Grading guidelines"],


can we rename this to rubric? I used rubric in one of our recipes, hoping to unify the namings.

smahdavi4 · 2025-11-19T19:55:50Z

nemo_skills/dataset/imo-gradingbench/prepare.py

+                        "grading_id": row["Grading ID"],
+                        "problem_id": row["Problem ID"],
+                        "problem_statement": row["Problem"],
+                        "solution": row["Solution"],


Is this model solution or reference solution, if reference solution, could rename to reference_solution for consistency to other benchmarks. Or ground_truth_proof if we want to keep final-answer benchmarks separate from proof benchmarks.

smahdavi4 · 2025-11-19T20:04:53Z

nemo_skills/dataset/imo-gradingbench/prepare.py

+                        "problem_statement": row["Problem"],
+                        "solution": row["Solution"],
+                        "grading_guidelines": row["Grading guidelines"],
+                        "response": row["Response"],


maybe rename to proof?

smahdavi4 · 2025-11-19T20:08:31Z

nemo_skills/dataset/imo-answerbench/__init__.py

+GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math"
+# Judge configuration: Use the AnswerAutoGrader prompt.
+# Recommended model: Gemini 2.5 Pro (or similar strong reasoner)
+JUDGE_ARGS = "++prompt_config=judge/imo_answerbench ++generation_key=judgement"


Gemini client uses no reasoning effort by default, need to add ++inference.reasoning_effort=dynamic for gemini to enable unlimited reasoning. But, that would be gemini-specific, so a better solution might be just to switch to openai models for default?

smahdavi4 · 2025-11-19T20:14:19Z

nemo_skills/dataset/imo-gradingbench/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+DATASET_GROUP = "judge"


We have answer-judgement data type for binary judgement, we could potentially expand that since imo-grading-bench has 4 different possible states (or maybe we could binarize the imo-proof-bench instead).

yes, we should unify, let's create an issue for this, but we don't need to do it in this pr

coderabbitai · 2025-12-23T21:27:23Z

📝 Walkthrough

Walkthrough

This PR introduces three new IMO-related math datasets (answerbench, gradingbench, proofbench) to the nemo_skills framework. Each dataset includes module configuration constants, a CSV-to-JSONL data preparation script, and associated prompt configurations for automated grading evaluation.

Changes

Cohort / File(s)	Summary
Dataset configuration modules `nemo_skills/dataset/imo-answerbench/__init__.py`, `nemo_skills/dataset/imo-gradingbench/__init__.py`, `nemo_skills/dataset/imo-proofbench/__init__.py`	Define module-level constants (DATASET_GROUP, METRICS_TYPE, GENERATION_ARGS, JUDGE_ARGS, JUDGE_PIPELINE_ARGS) configuring dataset grouping, metric type, prompt configurations, and judge pipeline model/server settings.
Data preparation scripts `nemo_skills/dataset/imo-answerbench/prepare.py`, `nemo_skills/dataset/imo-gradingbench/prepare.py`, `nemo_skills/dataset/imo-proofbench/prepare.py`	Download CSV files from remote GitHub sources, parse rows using csv.DictReader, transform fields into JSON objects, and write line-delimited JSON (JSONL) output files. Includes error handling to raise RuntimeError on download/processing failures.
Prompt configurations `nemo_skills/prompt/config/judge/imo_answerbench.yaml`, `nemo_skills/prompt/config/judge/imo_proofbench.yaml`	Define grading prompt templates specifying system roles, evaluation criteria, input/output formats, and example grading workflows for automated mathematical equivalence checking and IMO-style solution scoring.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Add apex-shortlist dataset #1080: Both PRs add new dataset packages under nemo_skills/dataset with identical module-level configuration constants and prepare.py scripts for CSV-to-JSONL conversion, following the same architectural pattern.

Suggested reviewers

gwarmstrong
Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing three new IMO benchmark datasets (imo-answerbench, imo-gradingbench, imo-proofbench) with their configurations and preparation scripts.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch stepheng/imo-bench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

nemo_skills/dataset/imo-gradingbench/prepare.py (1)
27-28: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.
🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/gradingbench.csv"
     output_file = data_dir / "test.jsonl"

-    import io
-    import urllib.request
-
     try:
nemo_skills/dataset/imo-answerbench/prepare.py (1)
30-31: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.
🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     # Download from the repo with specific commit hash
     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/answerbench.csv"
     output_file = data_dir / "test.jsonl"

     # Download the file to memory or temp file, then process
     # Using urllib for minimal dependencies
-    import io
-    import urllib.request
-
     try:
nemo_skills/dataset/imo-proofbench/prepare.py (1)
27-28: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.
🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/proofbench.csv"
     output_file = data_dir / "test.jsonl"

-    import io
-    import urllib.request
-
     try:

🧹 Nitpick comments (5)

nemo_skills/dataset/imo-gradingbench/prepare.py (1)
51-52: Improve exception handling.

The exception handler should:

Chain exceptions using raise ... from e to preserve the original traceback (B904).

Consider catching more specific exceptions like urllib.error.URLError and csv.Error instead of bare Exception.
🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e
nemo_skills/dataset/imo-answerbench/prepare.py (2)
49-50: Improve exception handling.

The exception handler should chain exceptions using raise ... from e to preserve the original traceback (B904).
🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e
19-50: Consider extracting shared CSV download logic.

All three prepare.py scripts (imo-answerbench, imo-gradingbench, imo-proofbench) follow nearly identical patterns for downloading CSVs from the same base URL and converting to JSONL. This presents a code duplication opportunity.

Consider creating a shared utility function to reduce duplication, which would make it easier to apply fixes (like timeout handling) consistently across all datasets.
nemo_skills/dataset/imo-proofbench/prepare.py (1)
48-49: Improve exception handling.

The exception handler should chain exceptions using raise ... from e to preserve the original traceback (B904).
🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e
nemo_skills/dataset/imo-proofbench/__init__.py (1)

18-19: Potentially outdated TODO comment.

Line 18's TODO suggests updating the prompt_config to point to a "specific ProofAutoGrader prompt once available," but Line 22 already uses judge/imo_proofbench, and the corresponding prompt file (nemo_skills/prompt/config/judge/imo_proofbench.yaml) exists in this PR. Consider removing or updating this TODO if the intended configuration is already in place.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 1beabda.

📒 Files selected for processing (8)

nemo_skills/dataset/imo-answerbench/__init__.py
nemo_skills/dataset/imo-answerbench/prepare.py
nemo_skills/dataset/imo-gradingbench/__init__.py
nemo_skills/dataset/imo-gradingbench/prepare.py
nemo_skills/dataset/imo-proofbench/__init__.py
nemo_skills/dataset/imo-proofbench/prepare.py
nemo_skills/prompt/config/judge/imo_answerbench.yaml
nemo_skills/prompt/config/judge/imo_proofbench.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T16:09:53.870Z

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

nemo_skills/prompt/config/judge/imo_proofbench.yaml
nemo_skills/prompt/config/judge/imo_answerbench.yaml

🪛 Ruff (0.14.10)

nemo_skills/dataset/imo-answerbench/prepare.py

34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

49-49: Do not catch blind exception: Exception

(BLE001)

50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

50-50: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/imo-proofbench/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

48-48: Do not catch blind exception: Exception

(BLE001)

49-49: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

49-49: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/imo-gradingbench/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

51-51: Do not catch blind exception: Exception

(BLE001)

52-52: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

52-52: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

coderabbitai · 2025-12-23T21:27:26Z

nemo_skills/dataset/imo-answerbench/prepare.py

+    import urllib.request
+
+    try:
+        with urllib.request.urlopen(source_url) as response:


⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix

- with urllib.request.urlopen(source_url) as response: + with urllib.request.urlopen(source_url, timeout=30) as response: content = response.read().decode("utf-8")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with urllib.request.urlopen(source_url) as response:

with urllib.request.urlopen(source_url, timeout=30) as response:

content = response.read().decode("utf-8")

🧰 Tools

🪛 Ruff (0.14.10)

34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

In nemo_skills/dataset/imo-answerbench/prepare.py around line 34, the urllib.request.urlopen call has no timeout which can hang the script; add a timeout parameter to the urlopen call (e.g., timeout=10) and wrap the request in a try/except that catches urllib.error.URLError and socket.timeout to log or re-raise a clear error so the script fails fast on network stalls.

coderabbitai · 2025-12-23T21:27:26Z

nemo_skills/dataset/imo-gradingbench/prepare.py

+    import urllib.request
+
+    try:
+        with urllib.request.urlopen(source_url) as response:


⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix

- with urllib.request.urlopen(source_url) as response: + with urllib.request.urlopen(source_url, timeout=30) as response: content = response.read().decode("utf-8")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with urllib.request.urlopen(source_url) as response:

with urllib.request.urlopen(source_url, timeout=30) as response:

content = response.read().decode("utf-8")

🧰 Tools

🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

In nemo_skills/dataset/imo-gradingbench/prepare.py around line 31 the urllib.request.urlopen call is missing a timeout which can cause the process to hang; update the call to pass a reasonable timeout (e.g., 10 seconds) and wrap the request in exception handling for urllib.error.URLError/socket.timeout to log or raise a clear error on timeout/failure so the script fails fast instead of hanging indefinitely.

coderabbitai · 2025-12-23T21:27:26Z

nemo_skills/dataset/imo-proofbench/prepare.py

+    import urllib.request
+
+    try:
+        with urllib.request.urlopen(source_url) as response:


⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix

- with urllib.request.urlopen(source_url) as response: + with urllib.request.urlopen(source_url, timeout=30) as response: content = response.read().decode("utf-8")

🧰 Tools

🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

In nemo_skills/dataset/imo-proofbench/prepare.py around line 31, the urllib.request.urlopen call has no timeout and can hang; add a timeout parameter (e.g., pass timeout=10 or use a module-level TIMEOUT constant) to the urlopen call and, if not already present, wrap the request in a try/except to catch socket.timeout and urllib.error.URLError so you can log or raise a clear error instead of blocking indefinitely.

Kipok · 2026-01-08T22:50:11Z

@stephencge let's try to complete this when you get a chance. These are important benchmarks to have in main, we should try to merge this soon

greptile-apps

Greptile Overview

Greptile Summary

This PR adds support for three IMO-Bench datasets from Google DeepMind's superhuman benchmark suite:

AnswerBench: Evaluates answer verification by having models determine if a proposed answer is mathematically equivalent to the golden answer. Uses a deterministic autograder prompt that strictly checks algebraic, numerical, and set equivalence.

GradingBench: Evaluates solution grading capabilities where models assign scores (incorrect/partial/almost/correct) to mathematical solutions. Includes custom metrics computing exact accuracy, binarized accuracy, and mean absolute error.

ProofBench: Evaluates rigorous IMO-style proof grading on a 0-7 point scale. Models assess proof completeness, correctness, and logical rigor using reference solutions and specific grading guidelines.

Supporting changes include:

Added GradingBenchMetrics class with grade extraction and MAE computation
Enhanced Gemini model integration to support custom base URLs and thinking parameter configuration
Added GEMINI_API_KEY to cluster environment variables
Added MathArena prompt configurations

All three datasets download from the same DeepMind GitHub repository commit and convert CSV data to JSONL format for the nemo_skills pipeline.

Confidence Score: 5/5

Safe to merge - well-structured addition of new datasets with appropriate integration
The implementation is clean and follows existing patterns. All three datasets are properly structured with matching prepare scripts, configs, and prompt templates. The GradingBench metrics implementation is robust with proper error handling. The Gemini model enhancements correctly handle base_url parameter passing. No breaking changes or bugs detected.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/imo-answerbench/init.py	5/5	New dataset config for IMO AnswerBench with judge pipeline configuration
nemo_skills/dataset/imo-answerbench/prepare.py	5/5	Dataset preparation script downloading and converting AnswerBench CSV to JSONL format
nemo_skills/dataset/imo-gradingbench/init.py	5/5	New dataset config for IMO GradingBench with judge evaluation type
nemo_skills/dataset/imo-gradingbench/prepare.py	5/5	Dataset preparation script downloading and converting GradingBench CSV to JSONL format
nemo_skills/dataset/imo-proofbench/init.py	5/5	New dataset config for IMO ProofBench with judge pipeline configuration
nemo_skills/dataset/imo-proofbench/prepare.py	5/5	Dataset preparation script downloading and converting ProofBench CSV to JSONL format
nemo_skills/evaluation/metrics/gradingbench_metrics.py	5/5	New metrics class for GradingBench computing exact accuracy, binarized accuracy, and MAE
nemo_skills/evaluation/metrics/map_metrics.py	5/5	Added GradingBenchMetrics to METRICS_MAP registry
nemo_skills/inference/model/gemini.py	5/5	Enhanced Gemini model to support custom base_url and fixed thinking parameter configuration
nemo_skills/pipeline/utils/cluster.py	5/5	Added GEMINI_API_KEY to auto-included environment variables list
nemo_skills/prompt/config/eval/matharena/aime.yaml	5/5	New MathArena AIME prompt config with answer constraint
nemo_skills/prompt/config/generic/matharena.yaml	5/5	New generic MathArena prompt config
nemo_skills/prompt/config/judge/imo_answerbench.yaml	5/5	Comprehensive autograder prompt for verifying mathematical answer equivalence
nemo_skills/prompt/config/judge/imo_gradingbench.yaml	5/5	Grading prompt for evaluating solution correctness with 4-level scoring
nemo_skills/prompt/config/judge/imo_proofbench.yaml	5/5	Rigorous IMO-style grading rubric for evaluating mathematical proofs on 0-7 scale

Sequence Diagram

sequenceDiagram
    participant User
    participant Prepare as Dataset Prepare Scripts
    participant DataFiles as Dataset JSONL Files
    participant Inference as Inference Pipeline
    participant Judge as Judge Prompts
    participant Metrics as Metrics System

    User->>Prepare: Run prepare.py for each dataset
    Prepare->>Prepare: Download CSV from DeepMind repo
    Prepare->>DataFiles: Convert to JSONL format
    Note over DataFiles: test.jsonl with fields:<br/>- AnswerBench: problem, expected_answer<br/>- GradingBench: problem_statement, proof, reward<br/>- ProofBench: problem, reference_solution, rubric
    
    User->>Inference: Generate solutions/judgements
    Inference->>DataFiles: Read dataset entries
    Inference->>Judge: Use judge prompt configs
    Note over Judge: Three judge types:<br/>- imo_answerbench: Answer verification<br/>- imo_gradingbench: Solution grading<br/>- imo_proofbench: Proof evaluation
    Judge->>Inference: Return judgement/score
    
    Inference->>Metrics: Pass predictions for evaluation
    Note over Metrics: GradingBench uses custom metrics<br/>AnswerBench/ProofBench use math metrics
    Metrics->>Metrics: Compute accuracy, MAE, etc.
    Metrics->>User: Report final metrics

Kipok · 2026-01-12T19:41:51Z

@stephencge looks like some conflicts, could you resolve please?

Kipok

could you update the docs please?

Kipok · 2026-01-12T19:43:36Z

nemo_skills/evaluation/metrics/gradingbench_metrics.py

+        last_word = words[-1]
+
+        # Handle </think> tags
+        if "</think>" in last_word:


is this really needed? This should ideally be handled by the generation part, not in metrics. The metrics should just see final summary

Kipok · 2026-01-12T19:44:56Z

nemo_skills/prompt/config/judge/imo_answerbench.yaml

+      These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements.
+  4. **Conclusion:** Incorrect
+  </thinking>
+  \boxed{Incorrect}


does this work? Shouldn't it be necessary to escape this with {{Incorrect}}?

Kipok · 2026-01-12T19:46:01Z

nemo_skills/dataset/imo-gradingbench/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+DATASET_GROUP = "judge"


yes, we should unify, let's create an issue for this, but we don't need to do it in this pr

Signed-off-by: Stephen Ge <stephen.ge@gmail.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

- Add GradingBenchMetrics class with exact_accuracy, binarized_accuracy, and MAE - Update copyright to 2025, move imports to top, add timeout to urlopen - Rename fields: grading_guidelines->rubric, solution->reference_solution, response->proof - Add reasoning_effort=dynamic to JUDGE_ARGS - Fix METRICS_TYPE from invalid "judge" to "math" - Add logging for grade extraction failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com>

greptile-apps · 2026-01-12T21:00:17Z

Greptile Summary

Added three new IMO-Bench datasets (AnswerBench, ProofBench, GradingBench) for evaluating mathematical reasoning at IMO level, with automated dataset preparation scripts, specialized judge prompts, and comprehensive metrics infrastructure.

Key changes:

New datasets: Three dataset modules with preparation scripts that download from Google DeepMind's IMO-Bench repository
Judge prompt configurations: Custom prompts for each benchmark type - deterministic autograding for AnswerBench, 0-7 point rubric-based grading for ProofBench, and meta-evaluation for GradingBench
Metrics infrastructure: Extended is_correct_judgement() to parse \boxed{Correct/Incorrect} and <points>N out of 7</points> formats; created GradingBenchMetrics class for specialized evaluation
Gemini integration improvements: Fixed base_url handling to support custom endpoints and enabled thinking thoughts inclusion for better reasoning
Documentation: Added comprehensive documentation for all three datasets in natural-math.md

All previous thread issues have been addressed - the parsing logic now correctly handles the new output formats.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-structured and comprehensive, with proper parsing logic to handle all three benchmark formats, specialized metrics classes, and thorough documentation. All concerns from previous review threads have been addressed with the addition of format-specific parsing in is_correct_judgement() and GradingBenchMetrics.
No files require special attention

Important Files Changed

Filename	Overview
nemo_skills/evaluation/metrics/utils.py	Extended `is_correct_judgement()` to parse IMO-bench formats: `\boxed{Correct/Incorrect}` for AnswerBench and `<points>N out of 7</points>` for ProofBench
nemo_skills/prompt/config/judge/imo_answerbench.yaml	Defines deterministic autograding prompt that outputs `\boxed{Correct/Incorrect}` format, correctly handled by updated `is_correct_judgement()`
nemo_skills/prompt/config/judge/imo_proofbench.yaml	Defines IMO proof grading prompt with 0-7 point scoring, correctly parsed by `is_correct_judgement()` (6-7=correct, 0-1=incorrect)
nemo_skills/prompt/config/judge/imo_gradingbench.yaml	Defines grading evaluation prompt that outputs `correct/almost/partial/incorrect`, parsed by GradingBenchMetrics class
nemo_skills/evaluation/metrics/gradingbench_metrics.py	Implements specialized metrics for IMO GradingBench with exact/binarized accuracy and MAE, correctly extracts grades from judgement output
nemo_skills/dataset/imo-answerbench/init.py	Configures AnswerBench dataset with math metrics, Gemini 2.5 Pro judge, and dynamic reasoning effort
nemo_skills/dataset/imo-proofbench/init.py	Configures ProofBench dataset with math metrics, Gemini 2.5 Pro judge, and proof evaluation workflow
nemo_skills/dataset/imo-gradingbench/init.py	Configures GradingBench dataset with specialized gradingbench metrics for evaluating model grading ability

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant Dataset as Dataset __init__.py
    participant Inference as Inference Pipeline
    participant Judge as LLM Judge (Gemini 2.5)
    participant Metrics as Metrics System
    
    User->>PrepareScript: Run dataset preparation
    PrepareScript->>PrepareScript: Download CSV from GitHub
    PrepareScript->>PrepareScript: Convert to JSONL format
    PrepareScript-->>User: test.jsonl created
    
    User->>Inference: Run evaluation
    Inference->>Dataset: Load config (METRICS_TYPE, JUDGE_ARGS)
    Dataset-->>Inference: Return dataset config
    
    alt AnswerBench or ProofBench
        Inference->>Judge: Send problem + answer/proof
        Note over Judge: Uses judge prompt<br/>(imo_answerbench.yaml<br/>or imo_proofbench.yaml)
        Judge->>Judge: Evaluate with reasoning
        Judge-->>Inference: Return judgement<br/>(\boxed{Correct/Incorrect}<br/>or <points>N out of 7</points>)
        Inference->>Metrics: Pass judgement to MathMetrics
        Metrics->>Metrics: Parse with is_correct_judgement()
        Metrics-->>User: Binary accuracy metrics
    else GradingBench
        Inference->>Judge: Send problem + proof + rubric
        Note over Judge: Uses judge prompt<br/>(imo_gradingbench.yaml)
        Judge->>Judge: Grade solution
        Judge-->>Inference: Return grade<br/>(correct/almost/partial/incorrect)
        Inference->>Metrics: Pass to GradingBenchMetrics
        Metrics->>Metrics: Extract grade from last word
        Metrics->>Metrics: Compute exact/binarized accuracy + MAE
        Metrics-->>User: Grading evaluation metrics
    end

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T21:00:21Z

nemo_skills/dataset/imo-answerbench/prepare.py

+    with urllib.request.urlopen(source_url, timeout=30) as response:
+        content = response.read().decode("utf-8")
+        reader = csv.DictReader(io.StringIO(content))
+
+        with open(output_file, "w", encoding="utf-8") as out:
+            for row in reader:
+                entry = {
+                    "problem_id": row["Problem ID"],
+                    "problem": row["Problem"],
+                    "expected_answer": row["Short Answer"],
+                    "category": row["Category"],
+                    "subcategory": row["Subcategory"],
+                    "source": row["Source"],
+                }
+                out.write(json.dumps(entry) + "\n")


The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks to handle potential errors gracefully:

Suggested change

with urllib.request.urlopen(source_url, timeout=30) as response:

content = response.read().decode("utf-8")

reader = csv.DictReader(io.StringIO(content))

with open(output_file, "w", encoding="utf-8") as out:

for row in reader:

entry = {

"problem_id": row["Problem ID"],

"problem": row["Problem"],

"expected_answer": row["Short Answer"],

"category": row["Category"],

"subcategory": row["Subcategory"],

"source": row["Source"],

}

out.write(json.dumps(entry) + "\n")

try:

with urllib.request.urlopen(source_url, timeout=30) as response:

content = response.read().decode("utf-8")

reader = csv.DictReader(io.StringIO(content))

with open(output_file, "w", encoding="utf-8") as out:

for row in reader:

try:

entry = {

"problem_id": row["Problem ID"],

"problem": row["Problem"],

"expected_answer": row["Short Answer"],

"category": row["Category"],

"subcategory": row["Subcategory"],

"source": row["Source"],

}

out.write(json.dumps(entry) + "\n")

except KeyError as e:

raise KeyError(f"Missing expected column in CSV: {e}")

except urllib.error.URLError as e:

raise RuntimeError(f"Failed to download dataset from {source_url}: {e}")

This matches the error handling pattern used in other dataset preparation scripts like nemo_skills/dataset/mmlu-prox/prepare.py.

greptile-apps · 2026-01-12T21:00:22Z

nemo_skills/dataset/imo-gradingbench/prepare.py

+    with urllib.request.urlopen(source_url, timeout=30) as response:
+        content = response.read().decode("utf-8")
+        reader = csv.DictReader(io.StringIO(content))
+
+        with open(output_file, "w", encoding="utf-8") as out:
+            for row in reader:
+                entry = {
+                    "grading_id": row["Grading ID"],
+                    "problem_id": row["Problem ID"],
+                    "problem_statement": row["Problem"],
+                    "reference_solution": row["Solution"],
+                    "rubric": row["Grading guidelines"],
+                    "proof": row["Response"],
+                    "points": row["Points"],
+                    "reward": row["Reward"],
+                    "source": row["Problem Source"],
+                    # We set 'expected_answer' to the reward/points for evaluation
+                    "expected_answer": row["Reward"],
+                }
+                out.write(json.dumps(entry) + "\n")


The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.

greptile-apps · 2026-01-12T21:00:23Z

nemo_skills/dataset/imo-proofbench/prepare.py

+    with urllib.request.urlopen(source_url, timeout=30) as response:
+        content = response.read().decode("utf-8")
+        reader = csv.DictReader(io.StringIO(content))
+
+        with open(output_file, "w", encoding="utf-8") as out:
+            for row in reader:
+                entry = {
+                    "problem_id": row["Problem ID"],
+                    "problem": row["Problem"],
+                    "reference_solution": row["Solution"],
+                    "rubric": row["Grading guidelines"],
+                    "category": row["Category"],
+                    "level": row["Level"],
+                    "expected_answer": row["Short Answer"],
+                    "source": row["Source"],
+                }
+                out.write(json.dumps(entry) + "\n")


The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T21:05:23Z

nemo_skills/prompt/config/judge/imo_proofbench.yaml

+  4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score.
+
+  ### Output Requirements
+  You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response.


The prompt asks judges to output scores in <points>N out of 7</points> format, but there's no corresponding extraction logic in the codebase to parse this format. The imo-proofbench dataset uses METRICS_TYPE = "math" which relies on MathMetrics that expects is_correct_judgement() to parse binary judgements, not numeric scores. This means the 0-7 point scores will not be properly extracted or evaluated.

greptile-apps · 2026-01-12T21:05:23Z

nemo_skills/evaluation/metrics/gradingbench_metrics.py

+
+    def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
+        """Extract predicted and expected grades from a prediction."""
+        judgement = prediction.get("judgement", "") or prediction.get("generation", "")


The code tries to extract judgement from either prediction.get("judgement", "") or prediction.get("generation", "") using fallback with or. If "judgement" key exists but has an empty string value, the or will incorrectly fall back to "generation". Use explicit None check: judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

greptile-apps

_{16 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-21T18:40:52Z

nemo_skills/prompt/config/judge/imo_answerbench.yaml

+  * If Correct: \boxed{{Correct}}
+  * If Incorrect: \boxed{{Incorrect}}


logic: The prompt outputs \boxed{Correct} or \boxed{Incorrect}, but the is_correct_judgement() function in nemo_skills/evaluation/metrics/utils.py:37 expects Judgement: Yes or Judgement: No format. This means judgements won't be parsed correctly, resulting in all answers being marked as incorrect.

Either:

Change prompt to output "Judgement: Yes" / "Judgement: No" format (like judge/math.yaml), or

Update is_correct_judgement() to also parse \boxed{Correct} / \boxed{Incorrect} format

greptile-apps · 2026-01-21T18:40:53Z

nemo_skills/prompt/config/judge/imo_proofbench.yaml

+  4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score.
+
+  ### Output Requirements
+  You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response.


logic: The prompt expects judges to output <points>N out of 7</points> format, but there's no extraction logic to parse this. The imo-proofbench/__init__.py:17 uses METRICS_TYPE = "math", which relies on is_correct_judgement() that only parses binary "Judgement: Yes/No" format. The 0-7 point scores won't be extracted or evaluated correctly.

greptile-apps · 2026-01-21T18:40:53Z

nemo_skills/evaluation/metrics/gradingbench_metrics.py

+
+    def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
+        """Extract predicted and expected grades from a prediction."""
+        judgement = prediction.get("judgement", "") or prediction.get("generation", "")


logic: Using or for fallback incorrectly handles empty string. If "judgement" key exists but has empty string value, the or operator will fall back to "generation" instead of using the empty string. Use explicit None check:

Suggested change

judgement = prediction.get("judgement", "") or prediction.get("generation", "")

judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

greptile-apps · 2026-01-21T18:40:54Z

nemo_skills/inference/model/gemini.py

+        # Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)
+        # Otherwise pass through the custom base_url


logic: Using ternary with falsy check if base_url is incorrect. If base_url="" is explicitly passed, it will be treated as falsy and replaced with "" again. Use explicit None check:

Suggested change

# Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)

# Otherwise pass through the custom base_url

super().__init__(base_url="" if base_url is None else base_url, *args, **kwargs)

The imo-gradingbench dataset was using judge/math prompt which expects {problem}, {predicted_answer}, {expected_answer} fields, but the data has {problem_statement} and {proof} fields. This mismatch caused the GPU test to fail with no output files being generated. Changed to use judge/imo_gradingbench prompt which uses the correct field names. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com>

greptile-apps · 2026-01-22T16:50:02Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Add support for additional judgement formats used by IMO-Bench: - \boxed{Correct/Incorrect} for IMO AnswerBench - <points>N out of 7</points> for IMO ProofBench (6-7 = correct, 0-1 = incorrect) This preserves the original prompt formats from the IMO-Bench paper while enabling proper parsing in MathMetrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com>

greptile-apps

_{17 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-22T16:57:38Z

nemo_skills/evaluation/metrics/gradingbench_metrics.py

+
+    def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
+        """Extract predicted and expected grades from a prediction."""
+        judgement = prediction.get("judgement", "") or prediction.get("generation", "")


logic: If "judgement" key exists with empty string value, the or will incorrectly fall back to "generation". Use explicit None check:

Suggested change

judgement = prediction.get("judgement", "") or prediction.get("generation", "")

judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

Signed-off-by: Stephen Ge <stephen.ge@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Stephen Ge <stephen.ge@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

stephencge requested a review from smahdavi4 November 19, 2025 02:27

gwarmstrong reviewed Nov 19, 2025

View reviewed changes

smahdavi4 reviewed Nov 19, 2025

View reviewed changes

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

stephencge force-pushed the stepheng/imo-bench branch from 1beabda to 66a4557 Compare January 9, 2026 05:40

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Kipok reviewed Jan 12, 2026

View reviewed changes

Kipok mentioned this pull request Jan 12, 2026

Unify answer bench and grading bench #1163

Open

stephencge and others added 7 commits January 12, 2026 12:50

first version of 3 imo-bench datasets

a9a7f78

Signed-off-by: Stephen Ge <stephen.ge@gmail.com>

add matharena final answer prompt

4f7bd0e

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

add matharena aime prompt

1db1bda

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

gemini api fixes

663451c

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

more small fixes

2988a90

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

gradingbench prompt

d28fe3c

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

thinking param

75cea4d

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

stephencge force-pushed the stepheng/imo-bench branch from 66a4557 to 6d82eda Compare January 12, 2026 20:54

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

stephencge force-pushed the stepheng/imo-bench branch from 6d82eda to dd6fbc5 Compare January 12, 2026 21:01

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Merge branch 'main' into stepheng/imo-bench

d633726

Kipok added the run GPU tests label Jan 21, 2026

greptile-apps bot reviewed Jan 21, 2026

View reviewed changes

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

stephencge added run GPU tests and removed run GPU tests labels Jan 22, 2026

Merge branch 'main' into stepheng/imo-bench

b378959

stephencge added run GPU tests and removed run GPU tests labels Jan 22, 2026

Kipok approved these changes Jan 22, 2026

View reviewed changes

Kipok merged commit dc04ea9 into main Jan 22, 2026
7 checks passed

Kipok deleted the stepheng/imo-bench branch January 22, 2026 21:10

This was referenced Feb 5, 2026

Gnalbandyan/add physics #1214

Merged

Fix answerbench prompt parsing #1235

Merged

coderabbitai bot mentioned this pull request Feb 19, 2026

Add aime26 #1256

Merged

		except Exception as e:
		raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")

		@@ -0,0 +1,26 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	with urllib.request.urlopen(source_url) as response:
	with urllib.request.urlopen(source_url, timeout=30) as response:
	content = response.read().decode("utf-8")

		* If Correct: \boxed{{Correct}}
		* If Incorrect: \boxed{{Incorrect}}

	judgement = prediction.get("judgement", "") or prediction.get("generation", "")
	judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

		# Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)
		# Otherwise pass through the custom base_url

	# Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)
	# Otherwise pass through the custom base_url
	super().__init__(base_url="" if base_url is None else base_url, args, *kwargs)

Conversation

stephencge commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smahdavi4 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smahdavi4 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 23, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jan 8, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Kipok commented Jan 12, 2026

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

stephencge commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

smahdavi4 Nov 19, 2025 •

edited

Loading

smahdavi4 Nov 19, 2025 •

edited

Loading

greptile-apps bot commented Jan 12, 2026 •

edited

Loading