Skip to content

first version of 3 imo-bench datasets#1047

Merged
Kipok merged 12 commits intomainfrom
stepheng/imo-bench
Jan 22, 2026
Merged

first version of 3 imo-bench datasets#1047
Kipok merged 12 commits intomainfrom
stepheng/imo-bench

Conversation

@stephencge
Copy link
Collaborator

@stephencge stephencge commented Nov 19, 2025

Summary by CodeRabbit

  • New Features
    • Added three new mathematical evaluation datasets for IMO problem benchmarking: AnswerBench (answer verification), GradingBench (solution grading), and ProofBench (proof evaluation)
    • Included automated dataset preparation and configuration tools
    • Added evaluation configurations supporting deterministic autograding and structured scoring workflows

✏️ Tip: You can customize this high-level summary in your review settings.

@stephencge stephencge requested a review from smahdavi4 November 19, 2025 02:27
Comment on lines +30 to +31
import io
import urllib.request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to top?

Comment on lines +49 to +50
except Exception as e:
raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to isolate the sources of failure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same elsewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worthwile having a shared utility somewhere since the pattern looks similar across all the benchmarks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could potentially not catch an exception at all since it would be due to network error most likely, and the network error is self-explanatory.

@@ -0,0 +1,26 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

also for other files

"problem_id": row["Problem ID"],
"problem_statement": row["Problem"],
"solution": row["Solution"],
"grading_guidelines": row["Grading guidelines"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to rubric? I used rubric in one of our recipes, hoping to unify the namings.

"grading_id": row["Grading ID"],
"problem_id": row["Problem ID"],
"problem_statement": row["Problem"],
"solution": row["Solution"],
Copy link
Collaborator

@smahdavi4 smahdavi4 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this model solution or reference solution, if reference solution, could rename to reference_solution for consistency to other benchmarks. Or ground_truth_proof if we want to keep final-answer benchmarks separate from proof benchmarks.

"problem_statement": row["Problem"],
"solution": row["Solution"],
"grading_guidelines": row["Grading guidelines"],
"response": row["Response"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to proof?

GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math"
# Judge configuration: Use the AnswerAutoGrader prompt.
# Recommended model: Gemini 2.5 Pro (or similar strong reasoner)
JUDGE_ARGS = "++prompt_config=judge/imo_answerbench ++generation_key=judgement"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini client uses no reasoning effort by default, need to add ++inference.reasoning_effort=dynamic for gemini to enable unlimited reasoning. But, that would be gemini-specific, so a better solution might be just to switch to openai models for default?

# See the License for the specific language governing permissions and
# limitations under the License.

DATASET_GROUP = "judge"
Copy link
Collaborator

@smahdavi4 smahdavi4 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have answer-judgement data type for binary judgement, we could potentially expand that since imo-grading-bench has 4 different possible states (or maybe we could binarize the imo-proof-bench instead).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should unify, let's create an issue for this, but we don't need to do it in this pr

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

This PR introduces three new IMO-related math datasets (answerbench, gradingbench, proofbench) to the nemo_skills framework. Each dataset includes module configuration constants, a CSV-to-JSONL data preparation script, and associated prompt configurations for automated grading evaluation.

Changes

Cohort / File(s) Summary
Dataset configuration modules
nemo_skills/dataset/imo-answerbench/__init__.py, nemo_skills/dataset/imo-gradingbench/__init__.py, nemo_skills/dataset/imo-proofbench/__init__.py
Define module-level constants (DATASET_GROUP, METRICS_TYPE, GENERATION_ARGS, JUDGE_ARGS, JUDGE_PIPELINE_ARGS) configuring dataset grouping, metric type, prompt configurations, and judge pipeline model/server settings.
Data preparation scripts
nemo_skills/dataset/imo-answerbench/prepare.py, nemo_skills/dataset/imo-gradingbench/prepare.py, nemo_skills/dataset/imo-proofbench/prepare.py
Download CSV files from remote GitHub sources, parse rows using csv.DictReader, transform fields into JSON objects, and write line-delimited JSON (JSONL) output files. Includes error handling to raise RuntimeError on download/processing failures.
Prompt configurations
nemo_skills/prompt/config/judge/imo_answerbench.yaml, nemo_skills/prompt/config/judge/imo_proofbench.yaml
Define grading prompt templates specifying system roles, evaluation criteria, input/output formats, and example grading workflows for automated mathematical equivalence checking and IMO-style solution scoring.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • Add apex-shortlist dataset #1080: Both PRs add new dataset packages under nemo_skills/dataset with identical module-level configuration constants and prepare.py scripts for CSV-to-JSONL conversion, following the same architectural pattern.

Suggested reviewers

  • gwarmstrong
  • Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing three new IMO benchmark datasets (imo-answerbench, imo-gradingbench, imo-proofbench) with their configurations and preparation scripts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch stepheng/imo-bench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
nemo_skills/dataset/imo-gradingbench/prepare.py (1)

27-28: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.

🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/gradingbench.csv"
     output_file = data_dir / "test.jsonl"

-    import io
-    import urllib.request
-
     try:
nemo_skills/dataset/imo-answerbench/prepare.py (1)

30-31: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.

🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     # Download from the repo with specific commit hash
     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/answerbench.csv"
     output_file = data_dir / "test.jsonl"

     # Download the file to memory or temp file, then process
     # Using urllib for minimal dependencies
-    import io
-    import urllib.request
-
     try:
nemo_skills/dataset/imo-proofbench/prepare.py (1)

27-28: Move imports to the top of the file.

Imports of io and urllib.request should be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.

🔎 Proposed fix
 import csv
+import io
 import json
+import urllib.request
 from pathlib import Path

 if __name__ == "__main__":
     data_dir = Path(__file__).absolute().parent
     data_dir.mkdir(exist_ok=True)

     base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench"
     source_url = f"{base_url}/proofbench.csv"
     output_file = data_dir / "test.jsonl"

-    import io
-    import urllib.request
-
     try:
🧹 Nitpick comments (5)
nemo_skills/dataset/imo-gradingbench/prepare.py (1)

51-52: Improve exception handling.

The exception handler should:

  1. Chain exceptions using raise ... from e to preserve the original traceback (B904).
  2. Consider catching more specific exceptions like urllib.error.URLError and csv.Error instead of bare Exception.
🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e
nemo_skills/dataset/imo-answerbench/prepare.py (2)

49-50: Improve exception handling.

The exception handler should chain exceptions using raise ... from e to preserve the original traceback (B904).

🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e

19-50: Consider extracting shared CSV download logic.

All three prepare.py scripts (imo-answerbench, imo-gradingbench, imo-proofbench) follow nearly identical patterns for downloading CSVs from the same base URL and converting to JSONL. This presents a code duplication opportunity.

Consider creating a shared utility function to reduce duplication, which would make it easier to apply fixes (like timeout handling) consistently across all datasets.

nemo_skills/dataset/imo-proofbench/prepare.py (1)

48-49: Improve exception handling.

The exception handler should chain exceptions using raise ... from e to preserve the original traceback (B904).

🔎 Proposed fix
-    except Exception as e:
-        raise RuntimeError(f"Failed to download or process file from {source_url}: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Failed to download or process file from {source_url}") from e
nemo_skills/dataset/imo-proofbench/__init__.py (1)

18-19: Potentially outdated TODO comment.

Line 18's TODO suggests updating the prompt_config to point to a "specific ProofAutoGrader prompt once available," but Line 22 already uses judge/imo_proofbench, and the corresponding prompt file (nemo_skills/prompt/config/judge/imo_proofbench.yaml) exists in this PR. Consider removing or updating this TODO if the intended configuration is already in place.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 1beabda.

📒 Files selected for processing (8)
  • nemo_skills/dataset/imo-answerbench/__init__.py
  • nemo_skills/dataset/imo-answerbench/prepare.py
  • nemo_skills/dataset/imo-gradingbench/__init__.py
  • nemo_skills/dataset/imo-gradingbench/prepare.py
  • nemo_skills/dataset/imo-proofbench/__init__.py
  • nemo_skills/dataset/imo-proofbench/prepare.py
  • nemo_skills/prompt/config/judge/imo_answerbench.yaml
  • nemo_skills/prompt/config/judge/imo_proofbench.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

  • nemo_skills/prompt/config/judge/imo_proofbench.yaml
  • nemo_skills/prompt/config/judge/imo_answerbench.yaml
🪛 Ruff (0.14.10)
nemo_skills/dataset/imo-answerbench/prepare.py

34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


49-49: Do not catch blind exception: Exception

(BLE001)


50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


50-50: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/imo-proofbench/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


48-48: Do not catch blind exception: Exception

(BLE001)


49-49: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


49-49: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/imo-gradingbench/prepare.py

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


51-51: Do not catch blind exception: Exception

(BLE001)


52-52: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


52-52: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests

import urllib.request

try:
with urllib.request.urlopen(source_url) as response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix
-        with urllib.request.urlopen(source_url) as response:
+        with urllib.request.urlopen(source_url, timeout=30) as response:
             content = response.read().decode("utf-8")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with urllib.request.urlopen(source_url) as response:
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
🧰 Tools
🪛 Ruff (0.14.10)

34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-answerbench/prepare.py around line 34, the
urllib.request.urlopen call has no timeout which can hang the script; add a
timeout parameter to the urlopen call (e.g., timeout=10) and wrap the request in
a try/except that catches urllib.error.URLError and socket.timeout to log or
re-raise a clear error so the script fails fast on network stalls.

import urllib.request

try:
with urllib.request.urlopen(source_url) as response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix
-        with urllib.request.urlopen(source_url) as response:
+        with urllib.request.urlopen(source_url, timeout=30) as response:
             content = response.read().decode("utf-8")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with urllib.request.urlopen(source_url) as response:
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
🧰 Tools
🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-gradingbench/prepare.py around line 31 the
urllib.request.urlopen call is missing a timeout which can cause the process to
hang; update the call to pass a reasonable timeout (e.g., 10 seconds) and wrap
the request in exception handling for urllib.error.URLError/socket.timeout to
log or raise a clear error on timeout/failure so the script fails fast instead
of hanging indefinitely.

import urllib.request

try:
with urllib.request.urlopen(source_url) as response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout to network request.

The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.

🔎 Proposed fix
-        with urllib.request.urlopen(source_url) as response:
+        with urllib.request.urlopen(source_url, timeout=30) as response:
             content = response.read().decode("utf-8")
🧰 Tools
🪛 Ruff (0.14.10)

31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-proofbench/prepare.py around line 31, the
urllib.request.urlopen call has no timeout and can hang; add a timeout parameter
(e.g., pass timeout=10 or use a module-level TIMEOUT constant) to the urlopen
call and, if not already present, wrap the request in a try/except to catch
socket.timeout and urllib.error.URLError so you can log or raise a clear error
instead of blocking indefinitely.

@Kipok
Copy link
Collaborator

Kipok commented Jan 8, 2026

@stephencge let's try to complete this when you get a chance. These are important benchmarks to have in main, we should try to merge this soon

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds support for three IMO-Bench datasets from Google DeepMind's superhuman benchmark suite:

AnswerBench: Evaluates answer verification by having models determine if a proposed answer is mathematically equivalent to the golden answer. Uses a deterministic autograder prompt that strictly checks algebraic, numerical, and set equivalence.

GradingBench: Evaluates solution grading capabilities where models assign scores (incorrect/partial/almost/correct) to mathematical solutions. Includes custom metrics computing exact accuracy, binarized accuracy, and mean absolute error.

ProofBench: Evaluates rigorous IMO-style proof grading on a 0-7 point scale. Models assess proof completeness, correctness, and logical rigor using reference solutions and specific grading guidelines.

Supporting changes include:

  • Added GradingBenchMetrics class with grade extraction and MAE computation
  • Enhanced Gemini model integration to support custom base URLs and thinking parameter configuration
  • Added GEMINI_API_KEY to cluster environment variables
  • Added MathArena prompt configurations

All three datasets download from the same DeepMind GitHub repository commit and convert CSV data to JSONL format for the nemo_skills pipeline.

Confidence Score: 5/5

  • Safe to merge - well-structured addition of new datasets with appropriate integration
  • The implementation is clean and follows existing patterns. All three datasets are properly structured with matching prepare scripts, configs, and prompt templates. The GradingBench metrics implementation is robust with proper error handling. The Gemini model enhancements correctly handle base_url parameter passing. No breaking changes or bugs detected.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/imo-answerbench/init.py 5/5 New dataset config for IMO AnswerBench with judge pipeline configuration
nemo_skills/dataset/imo-answerbench/prepare.py 5/5 Dataset preparation script downloading and converting AnswerBench CSV to JSONL format
nemo_skills/dataset/imo-gradingbench/init.py 5/5 New dataset config for IMO GradingBench with judge evaluation type
nemo_skills/dataset/imo-gradingbench/prepare.py 5/5 Dataset preparation script downloading and converting GradingBench CSV to JSONL format
nemo_skills/dataset/imo-proofbench/init.py 5/5 New dataset config for IMO ProofBench with judge pipeline configuration
nemo_skills/dataset/imo-proofbench/prepare.py 5/5 Dataset preparation script downloading and converting ProofBench CSV to JSONL format
nemo_skills/evaluation/metrics/gradingbench_metrics.py 5/5 New metrics class for GradingBench computing exact accuracy, binarized accuracy, and MAE
nemo_skills/evaluation/metrics/map_metrics.py 5/5 Added GradingBenchMetrics to METRICS_MAP registry
nemo_skills/inference/model/gemini.py 5/5 Enhanced Gemini model to support custom base_url and fixed thinking parameter configuration
nemo_skills/pipeline/utils/cluster.py 5/5 Added GEMINI_API_KEY to auto-included environment variables list
nemo_skills/prompt/config/eval/matharena/aime.yaml 5/5 New MathArena AIME prompt config with answer constraint
nemo_skills/prompt/config/generic/matharena.yaml 5/5 New generic MathArena prompt config
nemo_skills/prompt/config/judge/imo_answerbench.yaml 5/5 Comprehensive autograder prompt for verifying mathematical answer equivalence
nemo_skills/prompt/config/judge/imo_gradingbench.yaml 5/5 Grading prompt for evaluating solution correctness with 4-level scoring
nemo_skills/prompt/config/judge/imo_proofbench.yaml 5/5 Rigorous IMO-style grading rubric for evaluating mathematical proofs on 0-7 scale

Sequence Diagram

sequenceDiagram
    participant User
    participant Prepare as Dataset Prepare Scripts
    participant DataFiles as Dataset JSONL Files
    participant Inference as Inference Pipeline
    participant Judge as Judge Prompts
    participant Metrics as Metrics System

    User->>Prepare: Run prepare.py for each dataset
    Prepare->>Prepare: Download CSV from DeepMind repo
    Prepare->>DataFiles: Convert to JSONL format
    Note over DataFiles: test.jsonl with fields:<br/>- AnswerBench: problem, expected_answer<br/>- GradingBench: problem_statement, proof, reward<br/>- ProofBench: problem, reference_solution, rubric
    
    User->>Inference: Generate solutions/judgements
    Inference->>DataFiles: Read dataset entries
    Inference->>Judge: Use judge prompt configs
    Note over Judge: Three judge types:<br/>- imo_answerbench: Answer verification<br/>- imo_gradingbench: Solution grading<br/>- imo_proofbench: Proof evaluation
    Judge->>Inference: Return judgement/score
    
    Inference->>Metrics: Pass predictions for evaluation
    Note over Metrics: GradingBench uses custom metrics<br/>AnswerBench/ProofBench use math metrics
    Metrics->>Metrics: Compute accuracy, MAE, etc.
    Metrics->>User: Report final metrics
Loading

@Kipok
Copy link
Collaborator

Kipok commented Jan 12, 2026

@stephencge looks like some conflicts, could you resolve please?

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update the docs please?

last_word = words[-1]

# Handle </think> tags
if "</think>" in last_word:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? This should ideally be handled by the generation part, not in metrics. The metrics should just see final summary

These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements.
4. **Conclusion:** Incorrect
</thinking>
\boxed{Incorrect}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work? Shouldn't it be necessary to escape this with {{Incorrect}}?

# See the License for the specific language governing permissions and
# limitations under the License.

DATASET_GROUP = "judge"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should unify, let's create an issue for this, but we don't need to do it in this pr

stephencge and others added 7 commits January 12, 2026 12:50
Signed-off-by: Stephen Ge <stephen.ge@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
- Add GradingBenchMetrics class with exact_accuracy, binarized_accuracy, and MAE
- Update copyright to 2025, move imports to top, add timeout to urlopen
- Rename fields: grading_guidelines->rubric, solution->reference_solution, response->proof
- Add reasoning_effort=dynamic to JUDGE_ARGS
- Fix METRICS_TYPE from invalid "judge" to "math"
- Add logging for grade extraction failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Greptile Summary

Added three new IMO-Bench datasets (AnswerBench, ProofBench, GradingBench) for evaluating mathematical reasoning at IMO level, with automated dataset preparation scripts, specialized judge prompts, and comprehensive metrics infrastructure.

Key changes:

  • New datasets: Three dataset modules with preparation scripts that download from Google DeepMind's IMO-Bench repository
  • Judge prompt configurations: Custom prompts for each benchmark type - deterministic autograding for AnswerBench, 0-7 point rubric-based grading for ProofBench, and meta-evaluation for GradingBench
  • Metrics infrastructure: Extended is_correct_judgement() to parse \boxed{Correct/Incorrect} and <points>N out of 7</points> formats; created GradingBenchMetrics class for specialized evaluation
  • Gemini integration improvements: Fixed base_url handling to support custom endpoints and enabled thinking thoughts inclusion for better reasoning
  • Documentation: Added comprehensive documentation for all three datasets in natural-math.md

All previous thread issues have been addressed - the parsing logic now correctly handles the new output formats.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured and comprehensive, with proper parsing logic to handle all three benchmark formats, specialized metrics classes, and thorough documentation. All concerns from previous review threads have been addressed with the addition of format-specific parsing in is_correct_judgement() and GradingBenchMetrics.
  • No files require special attention

Important Files Changed

Filename Overview
nemo_skills/evaluation/metrics/utils.py Extended is_correct_judgement() to parse IMO-bench formats: \boxed{Correct/Incorrect} for AnswerBench and <points>N out of 7</points> for ProofBench
nemo_skills/prompt/config/judge/imo_answerbench.yaml Defines deterministic autograding prompt that outputs \boxed{Correct/Incorrect} format, correctly handled by updated is_correct_judgement()
nemo_skills/prompt/config/judge/imo_proofbench.yaml Defines IMO proof grading prompt with 0-7 point scoring, correctly parsed by is_correct_judgement() (6-7=correct, 0-1=incorrect)
nemo_skills/prompt/config/judge/imo_gradingbench.yaml Defines grading evaluation prompt that outputs correct/almost/partial/incorrect, parsed by GradingBenchMetrics class
nemo_skills/evaluation/metrics/gradingbench_metrics.py Implements specialized metrics for IMO GradingBench with exact/binarized accuracy and MAE, correctly extracts grades from judgement output
nemo_skills/dataset/imo-answerbench/init.py Configures AnswerBench dataset with math metrics, Gemini 2.5 Pro judge, and dynamic reasoning effort
nemo_skills/dataset/imo-proofbench/init.py Configures ProofBench dataset with math metrics, Gemini 2.5 Pro judge, and proof evaluation workflow
nemo_skills/dataset/imo-gradingbench/init.py Configures GradingBench dataset with specialized gradingbench metrics for evaluating model grading ability

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant Dataset as Dataset __init__.py
    participant Inference as Inference Pipeline
    participant Judge as LLM Judge (Gemini 2.5)
    participant Metrics as Metrics System
    
    User->>PrepareScript: Run dataset preparation
    PrepareScript->>PrepareScript: Download CSV from GitHub
    PrepareScript->>PrepareScript: Convert to JSONL format
    PrepareScript-->>User: test.jsonl created
    
    User->>Inference: Run evaluation
    Inference->>Dataset: Load config (METRICS_TYPE, JUDGE_ARGS)
    Dataset-->>Inference: Return dataset config
    
    alt AnswerBench or ProofBench
        Inference->>Judge: Send problem + answer/proof
        Note over Judge: Uses judge prompt<br/>(imo_answerbench.yaml<br/>or imo_proofbench.yaml)
        Judge->>Judge: Evaluate with reasoning
        Judge-->>Inference: Return judgement<br/>(\boxed{Correct/Incorrect}<br/>or <points>N out of 7</points>)
        Inference->>Metrics: Pass judgement to MathMetrics
        Metrics->>Metrics: Parse with is_correct_judgement()
        Metrics-->>User: Binary accuracy metrics
    else GradingBench
        Inference->>Judge: Send problem + proof + rubric
        Note over Judge: Uses judge prompt<br/>(imo_gradingbench.yaml)
        Judge->>Judge: Grade solution
        Judge-->>Inference: Return grade<br/>(correct/almost/partial/incorrect)
        Inference->>Metrics: Pass to GradingBenchMetrics
        Metrics->>Metrics: Extract grade from last word
        Metrics->>Metrics: Compute exact/binarized accuracy + MAE
        Metrics-->>User: Grading evaluation metrics
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +29 to +43
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
reader = csv.DictReader(io.StringIO(content))

with open(output_file, "w", encoding="utf-8") as out:
for row in reader:
entry = {
"problem_id": row["Problem ID"],
"problem": row["Problem"],
"expected_answer": row["Short Answer"],
"category": row["Category"],
"subcategory": row["Subcategory"],
"source": row["Source"],
}
out.write(json.dumps(entry) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks to handle potential errors gracefully:

Suggested change
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
reader = csv.DictReader(io.StringIO(content))
with open(output_file, "w", encoding="utf-8") as out:
for row in reader:
entry = {
"problem_id": row["Problem ID"],
"problem": row["Problem"],
"expected_answer": row["Short Answer"],
"category": row["Category"],
"subcategory": row["Subcategory"],
"source": row["Source"],
}
out.write(json.dumps(entry) + "\n")
try:
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
reader = csv.DictReader(io.StringIO(content))
with open(output_file, "w", encoding="utf-8") as out:
for row in reader:
try:
entry = {
"problem_id": row["Problem ID"],
"problem": row["Problem"],
"expected_answer": row["Short Answer"],
"category": row["Category"],
"subcategory": row["Subcategory"],
"source": row["Source"],
}
out.write(json.dumps(entry) + "\n")
except KeyError as e:
raise KeyError(f"Missing expected column in CSV: {e}")
except urllib.error.URLError as e:
raise RuntimeError(f"Failed to download dataset from {source_url}: {e}")

This matches the error handling pattern used in other dataset preparation scripts like nemo_skills/dataset/mmlu-prox/prepare.py.

Comment on lines +29 to +48
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
reader = csv.DictReader(io.StringIO(content))

with open(output_file, "w", encoding="utf-8") as out:
for row in reader:
entry = {
"grading_id": row["Grading ID"],
"problem_id": row["Problem ID"],
"problem_statement": row["Problem"],
"reference_solution": row["Solution"],
"rubric": row["Grading guidelines"],
"proof": row["Response"],
"points": row["Points"],
"reward": row["Reward"],
"source": row["Problem Source"],
# We set 'expected_answer' to the reward/points for evaluation
"expected_answer": row["Reward"],
}
out.write(json.dumps(entry) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.

Comment on lines +29 to +45
with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")
reader = csv.DictReader(io.StringIO(content))

with open(output_file, "w", encoding="utf-8") as out:
for row in reader:
entry = {
"problem_id": row["Problem ID"],
"problem": row["Problem"],
"reference_solution": row["Solution"],
"rubric": row["Grading guidelines"],
"category": row["Category"],
"level": row["Level"],
"expected_answer": row["Short Answer"],
"source": row["Source"],
}
out.write(json.dumps(entry) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.

Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score.

### Output Requirements
You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt asks judges to output scores in <points>N out of 7</points> format, but there's no corresponding extraction logic in the codebase to parse this format. The imo-proofbench dataset uses METRICS_TYPE = "math" which relies on MathMetrics that expects is_correct_judgement() to parse binary judgements, not numeric scores. This means the 0-7 point scores will not be properly extracted or evaluated.


def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
"""Extract predicted and expected grades from a prediction."""
judgement = prediction.get("judgement", "") or prediction.get("generation", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code tries to extract judgement from either prediction.get("judgement", "") or prediction.get("generation", "") using fallback with or. If "judgement" key exists but has an empty string value, the or will incorrectly fall back to "generation". Use explicit None check: judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +28 to +29
* If Correct: \boxed{{Correct}}
* If Incorrect: \boxed{{Incorrect}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The prompt outputs \boxed{Correct} or \boxed{Incorrect}, but the is_correct_judgement() function in nemo_skills/evaluation/metrics/utils.py:37 expects Judgement: Yes or Judgement: No format. This means judgements won't be parsed correctly, resulting in all answers being marked as incorrect.

Either:

  1. Change prompt to output "Judgement: Yes" / "Judgement: No" format (like judge/math.yaml), or
  2. Update is_correct_judgement() to also parse \boxed{Correct} / \boxed{Incorrect} format

4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score.

### Output Requirements
You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The prompt expects judges to output <points>N out of 7</points> format, but there's no extraction logic to parse this. The imo-proofbench/__init__.py:17 uses METRICS_TYPE = "math", which relies on is_correct_judgement() that only parses binary "Judgement: Yes/No" format. The 0-7 point scores won't be extracted or evaluated correctly.


def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
"""Extract predicted and expected grades from a prediction."""
judgement = prediction.get("judgement", "") or prediction.get("generation", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using or for fallback incorrectly handles empty string. If "judgement" key exists but has empty string value, the or operator will fall back to "generation" instead of using the empty string. Use explicit None check:

Suggested change
judgement = prediction.get("judgement", "") or prediction.get("generation", "")
judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

Comment on lines +30 to +31
# Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)
# Otherwise pass through the custom base_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using ternary with falsy check if base_url is incorrect. If base_url="" is explicitly passed, it will be treated as falsy and replaced with "" again. Use explicit None check:

Suggested change
# Use empty string only if no base_url provided (uses litellm's default Gemini endpoint)
# Otherwise pass through the custom base_url
super().__init__(base_url="" if base_url is None else base_url, *args, **kwargs)

The imo-gradingbench dataset was using judge/math prompt which expects
{problem}, {predicted_answer}, {expected_answer} fields, but the data
has {problem_statement} and {proof} fields. This mismatch caused the
GPU test to fail with no output files being generated.

Changed to use judge/imo_gradingbench prompt which uses the correct
field names.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Add support for additional judgement formats used by IMO-Bench:
- \boxed{Correct/Incorrect} for IMO AnswerBench
- <points>N out of 7</points> for IMO ProofBench (6-7 = correct, 0-1 = incorrect)

This preserves the original prompt formats from the IMO-Bench paper
while enabling proper parsing in MathMetrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]:
"""Extract predicted and expected grades from a prediction."""
judgement = prediction.get("judgement", "") or prediction.get("generation", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: If "judgement" key exists with empty string value, the or will incorrectly fall back to "generation". Use explicit None check:

Suggested change
judgement = prediction.get("judgement", "") or prediction.get("generation", "")
judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")

@Kipok Kipok merged commit dc04ea9 into main Jan 22, 2026
7 checks passed
@Kipok Kipok deleted the stepheng/imo-bench branch January 22, 2026 21:10
@coderabbitai coderabbitai bot mentioned this pull request Feb 19, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Stephen Ge <stephen.ge@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Stephen Ge <stephen.ge@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants