Conversation
| import io | ||
| import urllib.request |
| except Exception as e: | ||
| raise RuntimeError(f"Failed to download or process file from {source_url}: {e}") |
There was a problem hiding this comment.
it might be better to isolate the sources of failure?
There was a problem hiding this comment.
it might be worthwile having a shared utility somewhere since the pattern looks similar across all the benchmarks?
There was a problem hiding this comment.
or we could potentially not catch an exception at all since it would be due to network error most likely, and the network error is self-explanatory.
| @@ -0,0 +1,26 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
also for other files
| "problem_id": row["Problem ID"], | ||
| "problem_statement": row["Problem"], | ||
| "solution": row["Solution"], | ||
| "grading_guidelines": row["Grading guidelines"], |
There was a problem hiding this comment.
can we rename this to rubric? I used rubric in one of our recipes, hoping to unify the namings.
| "grading_id": row["Grading ID"], | ||
| "problem_id": row["Problem ID"], | ||
| "problem_statement": row["Problem"], | ||
| "solution": row["Solution"], |
There was a problem hiding this comment.
Is this model solution or reference solution, if reference solution, could rename to reference_solution for consistency to other benchmarks. Or ground_truth_proof if we want to keep final-answer benchmarks separate from proof benchmarks.
| "problem_statement": row["Problem"], | ||
| "solution": row["Solution"], | ||
| "grading_guidelines": row["Grading guidelines"], | ||
| "response": row["Response"], |
| GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math" | ||
| # Judge configuration: Use the AnswerAutoGrader prompt. | ||
| # Recommended model: Gemini 2.5 Pro (or similar strong reasoner) | ||
| JUDGE_ARGS = "++prompt_config=judge/imo_answerbench ++generation_key=judgement" |
There was a problem hiding this comment.
Gemini client uses no reasoning effort by default, need to add ++inference.reasoning_effort=dynamic for gemini to enable unlimited reasoning. But, that would be gemini-specific, so a better solution might be just to switch to openai models for default?
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| DATASET_GROUP = "judge" |
There was a problem hiding this comment.
We have answer-judgement data type for binary judgement, we could potentially expand that since imo-grading-bench has 4 different possible states (or maybe we could binarize the imo-proof-bench instead).
There was a problem hiding this comment.
yes, we should unify, let's create an issue for this, but we don't need to do it in this pr
📝 WalkthroughWalkthroughThis PR introduces three new IMO-related math datasets (answerbench, gradingbench, proofbench) to the nemo_skills framework. Each dataset includes module configuration constants, a CSV-to-JSONL data preparation script, and associated prompt configurations for automated grading evaluation. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
nemo_skills/dataset/imo-gradingbench/prepare.py (1)
27-28: Move imports to the top of the file.Imports of
ioandurllib.requestshould be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.🔎 Proposed fix
import csv +import io import json +import urllib.request from pathlib import Path if __name__ == "__main__": data_dir = Path(__file__).absolute().parent data_dir.mkdir(exist_ok=True) base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench" source_url = f"{base_url}/gradingbench.csv" output_file = data_dir / "test.jsonl" - import io - import urllib.request - try:nemo_skills/dataset/imo-answerbench/prepare.py (1)
30-31: Move imports to the top of the file.Imports of
ioandurllib.requestshould be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.🔎 Proposed fix
import csv +import io import json +import urllib.request from pathlib import Path if __name__ == "__main__": data_dir = Path(__file__).absolute().parent data_dir.mkdir(exist_ok=True) # Download from the repo with specific commit hash base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench" source_url = f"{base_url}/answerbench.csv" output_file = data_dir / "test.jsonl" # Download the file to memory or temp file, then process # Using urllib for minimal dependencies - import io - import urllib.request - try:nemo_skills/dataset/imo-proofbench/prepare.py (1)
27-28: Move imports to the top of the file.Imports of
ioandurllib.requestshould be moved to the top-level imports section (lines 15-17) for better readability and standard Python conventions.🔎 Proposed fix
import csv +import io import json +import urllib.request from pathlib import Path if __name__ == "__main__": data_dir = Path(__file__).absolute().parent data_dir.mkdir(exist_ok=True) base_url = "https://raw.githubusercontent.com/google-deepmind/superhuman/c1ee02e03d4cdb2ab21cd01ac927d895f5287fc8/imobench" source_url = f"{base_url}/proofbench.csv" output_file = data_dir / "test.jsonl" - import io - import urllib.request - try:
🧹 Nitpick comments (5)
nemo_skills/dataset/imo-gradingbench/prepare.py (1)
51-52: Improve exception handling.The exception handler should:
- Chain exceptions using
raise ... from eto preserve the original traceback (B904).- Consider catching more specific exceptions like
urllib.error.URLErrorandcsv.Errorinstead of bareException.🔎 Proposed fix
- except Exception as e: - raise RuntimeError(f"Failed to download or process file from {source_url}: {e}") + except Exception as e: + raise RuntimeError(f"Failed to download or process file from {source_url}") from enemo_skills/dataset/imo-answerbench/prepare.py (2)
49-50: Improve exception handling.The exception handler should chain exceptions using
raise ... from eto preserve the original traceback (B904).🔎 Proposed fix
- except Exception as e: - raise RuntimeError(f"Failed to download or process file from {source_url}: {e}") + except Exception as e: + raise RuntimeError(f"Failed to download or process file from {source_url}") from e
19-50: Consider extracting shared CSV download logic.All three prepare.py scripts (imo-answerbench, imo-gradingbench, imo-proofbench) follow nearly identical patterns for downloading CSVs from the same base URL and converting to JSONL. This presents a code duplication opportunity.
Consider creating a shared utility function to reduce duplication, which would make it easier to apply fixes (like timeout handling) consistently across all datasets.
nemo_skills/dataset/imo-proofbench/prepare.py (1)
48-49: Improve exception handling.The exception handler should chain exceptions using
raise ... from eto preserve the original traceback (B904).🔎 Proposed fix
- except Exception as e: - raise RuntimeError(f"Failed to download or process file from {source_url}: {e}") + except Exception as e: + raise RuntimeError(f"Failed to download or process file from {source_url}") from enemo_skills/dataset/imo-proofbench/__init__.py (1)
18-19: Potentially outdated TODO comment.Line 18's TODO suggests updating the prompt_config to point to a "specific ProofAutoGrader prompt once available," but Line 22 already uses
judge/imo_proofbench, and the corresponding prompt file (nemo_skills/prompt/config/judge/imo_proofbench.yaml) exists in this PR. Consider removing or updating this TODO if the intended configuration is already in place.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
nemo_skills/dataset/imo-answerbench/__init__.pynemo_skills/dataset/imo-answerbench/prepare.pynemo_skills/dataset/imo-gradingbench/__init__.pynemo_skills/dataset/imo-gradingbench/prepare.pynemo_skills/dataset/imo-proofbench/__init__.pynemo_skills/dataset/imo-proofbench/prepare.pynemo_skills/prompt/config/judge/imo_answerbench.yamlnemo_skills/prompt/config/judge/imo_proofbench.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/prompt/config/judge/imo_proofbench.yamlnemo_skills/prompt/config/judge/imo_answerbench.yaml
🪛 Ruff (0.14.10)
nemo_skills/dataset/imo-answerbench/prepare.py
34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
49-49: Do not catch blind exception: Exception
(BLE001)
50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
50-50: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/dataset/imo-proofbench/prepare.py
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
48-48: Do not catch blind exception: Exception
(BLE001)
49-49: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
49-49: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/dataset/imo-gradingbench/prepare.py
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
51-51: Do not catch blind exception: Exception
(BLE001)
52-52: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
52-52: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
| import urllib.request | ||
|
|
||
| try: | ||
| with urllib.request.urlopen(source_url) as response: |
There was a problem hiding this comment.
Add timeout to network request.
The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.
🔎 Proposed fix
- with urllib.request.urlopen(source_url) as response:
+ with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with urllib.request.urlopen(source_url) as response: | |
| with urllib.request.urlopen(source_url, timeout=30) as response: | |
| content = response.read().decode("utf-8") |
🧰 Tools
🪛 Ruff (0.14.10)
34-34: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-answerbench/prepare.py around line 34, the
urllib.request.urlopen call has no timeout which can hang the script; add a
timeout parameter to the urlopen call (e.g., timeout=10) and wrap the request in
a try/except that catches urllib.error.URLError and socket.timeout to log or
re-raise a clear error so the script fails fast on network stalls.
| import urllib.request | ||
|
|
||
| try: | ||
| with urllib.request.urlopen(source_url) as response: |
There was a problem hiding this comment.
Add timeout to network request.
The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.
🔎 Proposed fix
- with urllib.request.urlopen(source_url) as response:
+ with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with urllib.request.urlopen(source_url) as response: | |
| with urllib.request.urlopen(source_url, timeout=30) as response: | |
| content = response.read().decode("utf-8") |
🧰 Tools
🪛 Ruff (0.14.10)
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-gradingbench/prepare.py around line 31 the
urllib.request.urlopen call is missing a timeout which can cause the process to
hang; update the call to pass a reasonable timeout (e.g., 10 seconds) and wrap
the request in exception handling for urllib.error.URLError/socket.timeout to
log or raise a clear error on timeout/failure so the script fails fast instead
of hanging indefinitely.
| import urllib.request | ||
|
|
||
| try: | ||
| with urllib.request.urlopen(source_url) as response: |
There was a problem hiding this comment.
Add timeout to network request.
The urlopen call lacks a timeout parameter, which could cause the script to hang indefinitely if the remote server is unresponsive.
🔎 Proposed fix
- with urllib.request.urlopen(source_url) as response:
+ with urllib.request.urlopen(source_url, timeout=30) as response:
content = response.read().decode("utf-8")🧰 Tools
🪛 Ruff (0.14.10)
31-31: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
In nemo_skills/dataset/imo-proofbench/prepare.py around line 31, the
urllib.request.urlopen call has no timeout and can hang; add a timeout parameter
(e.g., pass timeout=10 or use a module-level TIMEOUT constant) to the urlopen
call and, if not already present, wrap the request in a try/except to catch
socket.timeout and urllib.error.URLError so you can log or raise a clear error
instead of blocking indefinitely.
|
@stephencge let's try to complete this when you get a chance. These are important benchmarks to have in main, we should try to merge this soon |
1beabda to
66a4557
Compare
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds support for three IMO-Bench datasets from Google DeepMind's superhuman benchmark suite:
AnswerBench: Evaluates answer verification by having models determine if a proposed answer is mathematically equivalent to the golden answer. Uses a deterministic autograder prompt that strictly checks algebraic, numerical, and set equivalence.
GradingBench: Evaluates solution grading capabilities where models assign scores (incorrect/partial/almost/correct) to mathematical solutions. Includes custom metrics computing exact accuracy, binarized accuracy, and mean absolute error.
ProofBench: Evaluates rigorous IMO-style proof grading on a 0-7 point scale. Models assess proof completeness, correctness, and logical rigor using reference solutions and specific grading guidelines.
Supporting changes include:
- Added
GradingBenchMetricsclass with grade extraction and MAE computation - Enhanced Gemini model integration to support custom base URLs and thinking parameter configuration
- Added
GEMINI_API_KEYto cluster environment variables - Added MathArena prompt configurations
All three datasets download from the same DeepMind GitHub repository commit and convert CSV data to JSONL format for the nemo_skills pipeline.
Confidence Score: 5/5
- Safe to merge - well-structured addition of new datasets with appropriate integration
- The implementation is clean and follows existing patterns. All three datasets are properly structured with matching prepare scripts, configs, and prompt templates. The GradingBench metrics implementation is robust with proper error handling. The Gemini model enhancements correctly handle base_url parameter passing. No breaking changes or bugs detected.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/imo-answerbench/init.py | 5/5 | New dataset config for IMO AnswerBench with judge pipeline configuration |
| nemo_skills/dataset/imo-answerbench/prepare.py | 5/5 | Dataset preparation script downloading and converting AnswerBench CSV to JSONL format |
| nemo_skills/dataset/imo-gradingbench/init.py | 5/5 | New dataset config for IMO GradingBench with judge evaluation type |
| nemo_skills/dataset/imo-gradingbench/prepare.py | 5/5 | Dataset preparation script downloading and converting GradingBench CSV to JSONL format |
| nemo_skills/dataset/imo-proofbench/init.py | 5/5 | New dataset config for IMO ProofBench with judge pipeline configuration |
| nemo_skills/dataset/imo-proofbench/prepare.py | 5/5 | Dataset preparation script downloading and converting ProofBench CSV to JSONL format |
| nemo_skills/evaluation/metrics/gradingbench_metrics.py | 5/5 | New metrics class for GradingBench computing exact accuracy, binarized accuracy, and MAE |
| nemo_skills/evaluation/metrics/map_metrics.py | 5/5 | Added GradingBenchMetrics to METRICS_MAP registry |
| nemo_skills/inference/model/gemini.py | 5/5 | Enhanced Gemini model to support custom base_url and fixed thinking parameter configuration |
| nemo_skills/pipeline/utils/cluster.py | 5/5 | Added GEMINI_API_KEY to auto-included environment variables list |
| nemo_skills/prompt/config/eval/matharena/aime.yaml | 5/5 | New MathArena AIME prompt config with answer constraint |
| nemo_skills/prompt/config/generic/matharena.yaml | 5/5 | New generic MathArena prompt config |
| nemo_skills/prompt/config/judge/imo_answerbench.yaml | 5/5 | Comprehensive autograder prompt for verifying mathematical answer equivalence |
| nemo_skills/prompt/config/judge/imo_gradingbench.yaml | 5/5 | Grading prompt for evaluating solution correctness with 4-level scoring |
| nemo_skills/prompt/config/judge/imo_proofbench.yaml | 5/5 | Rigorous IMO-style grading rubric for evaluating mathematical proofs on 0-7 scale |
Sequence Diagram
sequenceDiagram
participant User
participant Prepare as Dataset Prepare Scripts
participant DataFiles as Dataset JSONL Files
participant Inference as Inference Pipeline
participant Judge as Judge Prompts
participant Metrics as Metrics System
User->>Prepare: Run prepare.py for each dataset
Prepare->>Prepare: Download CSV from DeepMind repo
Prepare->>DataFiles: Convert to JSONL format
Note over DataFiles: test.jsonl with fields:<br/>- AnswerBench: problem, expected_answer<br/>- GradingBench: problem_statement, proof, reward<br/>- ProofBench: problem, reference_solution, rubric
User->>Inference: Generate solutions/judgements
Inference->>DataFiles: Read dataset entries
Inference->>Judge: Use judge prompt configs
Note over Judge: Three judge types:<br/>- imo_answerbench: Answer verification<br/>- imo_gradingbench: Solution grading<br/>- imo_proofbench: Proof evaluation
Judge->>Inference: Return judgement/score
Inference->>Metrics: Pass predictions for evaluation
Note over Metrics: GradingBench uses custom metrics<br/>AnswerBench/ProofBench use math metrics
Metrics->>Metrics: Compute accuracy, MAE, etc.
Metrics->>User: Report final metrics
|
@stephencge looks like some conflicts, could you resolve please? |
Kipok
left a comment
There was a problem hiding this comment.
could you update the docs please?
| last_word = words[-1] | ||
|
|
||
| # Handle </think> tags | ||
| if "</think>" in last_word: |
There was a problem hiding this comment.
is this really needed? This should ideally be handled by the generation part, not in metrics. The metrics should just see final summary
| These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements. | ||
| 4. **Conclusion:** Incorrect | ||
| </thinking> | ||
| \boxed{Incorrect} |
There was a problem hiding this comment.
does this work? Shouldn't it be necessary to escape this with {{Incorrect}}?
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| DATASET_GROUP = "judge" |
There was a problem hiding this comment.
yes, we should unify, let's create an issue for this, but we don't need to do it in this pr
Signed-off-by: Stephen Ge <stephen.ge@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
66a4557 to
6d82eda
Compare
- Add GradingBenchMetrics class with exact_accuracy, binarized_accuracy, and MAE - Update copyright to 2025, move imports to top, add timeout to urlopen - Rename fields: grading_guidelines->rubric, solution->reference_solution, response->proof - Add reasoning_effort=dynamic to JUDGE_ARGS - Fix METRICS_TYPE from invalid "judge" to "math" - Add logging for grade extraction failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Greptile SummaryAdded three new IMO-Bench datasets (AnswerBench, ProofBench, GradingBench) for evaluating mathematical reasoning at IMO level, with automated dataset preparation scripts, specialized judge prompts, and comprehensive metrics infrastructure. Key changes:
All previous thread issues have been addressed - the parsing logic now correctly handles the new output formats. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant Dataset as Dataset __init__.py
participant Inference as Inference Pipeline
participant Judge as LLM Judge (Gemini 2.5)
participant Metrics as Metrics System
User->>PrepareScript: Run dataset preparation
PrepareScript->>PrepareScript: Download CSV from GitHub
PrepareScript->>PrepareScript: Convert to JSONL format
PrepareScript-->>User: test.jsonl created
User->>Inference: Run evaluation
Inference->>Dataset: Load config (METRICS_TYPE, JUDGE_ARGS)
Dataset-->>Inference: Return dataset config
alt AnswerBench or ProofBench
Inference->>Judge: Send problem + answer/proof
Note over Judge: Uses judge prompt<br/>(imo_answerbench.yaml<br/>or imo_proofbench.yaml)
Judge->>Judge: Evaluate with reasoning
Judge-->>Inference: Return judgement<br/>(\boxed{Correct/Incorrect}<br/>or <points>N out of 7</points>)
Inference->>Metrics: Pass judgement to MathMetrics
Metrics->>Metrics: Parse with is_correct_judgement()
Metrics-->>User: Binary accuracy metrics
else GradingBench
Inference->>Judge: Send problem + proof + rubric
Note over Judge: Uses judge prompt<br/>(imo_gradingbench.yaml)
Judge->>Judge: Grade solution
Judge-->>Inference: Return grade<br/>(correct/almost/partial/incorrect)
Inference->>Metrics: Pass to GradingBenchMetrics
Metrics->>Metrics: Extract grade from last word
Metrics->>Metrics: Compute exact/binarized accuracy + MAE
Metrics-->>User: Grading evaluation metrics
end
|
| with urllib.request.urlopen(source_url, timeout=30) as response: | ||
| content = response.read().decode("utf-8") | ||
| reader = csv.DictReader(io.StringIO(content)) | ||
|
|
||
| with open(output_file, "w", encoding="utf-8") as out: | ||
| for row in reader: | ||
| entry = { | ||
| "problem_id": row["Problem ID"], | ||
| "problem": row["Problem"], | ||
| "expected_answer": row["Short Answer"], | ||
| "category": row["Category"], | ||
| "subcategory": row["Subcategory"], | ||
| "source": row["Source"], | ||
| } | ||
| out.write(json.dumps(entry) + "\n") |
There was a problem hiding this comment.
The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.
Consider adding try-except blocks to handle potential errors gracefully:
| with urllib.request.urlopen(source_url, timeout=30) as response: | |
| content = response.read().decode("utf-8") | |
| reader = csv.DictReader(io.StringIO(content)) | |
| with open(output_file, "w", encoding="utf-8") as out: | |
| for row in reader: | |
| entry = { | |
| "problem_id": row["Problem ID"], | |
| "problem": row["Problem"], | |
| "expected_answer": row["Short Answer"], | |
| "category": row["Category"], | |
| "subcategory": row["Subcategory"], | |
| "source": row["Source"], | |
| } | |
| out.write(json.dumps(entry) + "\n") | |
| try: | |
| with urllib.request.urlopen(source_url, timeout=30) as response: | |
| content = response.read().decode("utf-8") | |
| reader = csv.DictReader(io.StringIO(content)) | |
| with open(output_file, "w", encoding="utf-8") as out: | |
| for row in reader: | |
| try: | |
| entry = { | |
| "problem_id": row["Problem ID"], | |
| "problem": row["Problem"], | |
| "expected_answer": row["Short Answer"], | |
| "category": row["Category"], | |
| "subcategory": row["Subcategory"], | |
| "source": row["Source"], | |
| } | |
| out.write(json.dumps(entry) + "\n") | |
| except KeyError as e: | |
| raise KeyError(f"Missing expected column in CSV: {e}") | |
| except urllib.error.URLError as e: | |
| raise RuntimeError(f"Failed to download dataset from {source_url}: {e}") |
This matches the error handling pattern used in other dataset preparation scripts like nemo_skills/dataset/mmlu-prox/prepare.py.
| with urllib.request.urlopen(source_url, timeout=30) as response: | ||
| content = response.read().decode("utf-8") | ||
| reader = csv.DictReader(io.StringIO(content)) | ||
|
|
||
| with open(output_file, "w", encoding="utf-8") as out: | ||
| for row in reader: | ||
| entry = { | ||
| "grading_id": row["Grading ID"], | ||
| "problem_id": row["Problem ID"], | ||
| "problem_statement": row["Problem"], | ||
| "reference_solution": row["Solution"], | ||
| "rubric": row["Grading guidelines"], | ||
| "proof": row["Response"], | ||
| "points": row["Points"], | ||
| "reward": row["Reward"], | ||
| "source": row["Problem Source"], | ||
| # We set 'expected_answer' to the reward/points for evaluation | ||
| "expected_answer": row["Reward"], | ||
| } | ||
| out.write(json.dumps(entry) + "\n") |
There was a problem hiding this comment.
The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.
Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.
| with urllib.request.urlopen(source_url, timeout=30) as response: | ||
| content = response.read().decode("utf-8") | ||
| reader = csv.DictReader(io.StringIO(content)) | ||
|
|
||
| with open(output_file, "w", encoding="utf-8") as out: | ||
| for row in reader: | ||
| entry = { | ||
| "problem_id": row["Problem ID"], | ||
| "problem": row["Problem"], | ||
| "reference_solution": row["Solution"], | ||
| "rubric": row["Grading guidelines"], | ||
| "category": row["Category"], | ||
| "level": row["Level"], | ||
| "expected_answer": row["Short Answer"], | ||
| "source": row["Source"], | ||
| } | ||
| out.write(json.dumps(entry) + "\n") |
There was a problem hiding this comment.
The urllib.request.urlopen() call and CSV parsing lack error handling. Network failures, HTTP errors, timeouts, or missing CSV columns will cause the script to crash with an unhelpful error message.
Consider adding try-except blocks similar to the pattern used in nemo_skills/dataset/mmlu-prox/prepare.py which handles errors gracefully when downloading from external sources.
6d82eda to
dd6fbc5
Compare
| 4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score. | ||
|
|
||
| ### Output Requirements | ||
| You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response. |
There was a problem hiding this comment.
The prompt asks judges to output scores in <points>N out of 7</points> format, but there's no corresponding extraction logic in the codebase to parse this format. The imo-proofbench dataset uses METRICS_TYPE = "math" which relies on MathMetrics that expects is_correct_judgement() to parse binary judgements, not numeric scores. This means the 0-7 point scores will not be properly extracted or evaluated.
|
|
||
| def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]: | ||
| """Extract predicted and expected grades from a prediction.""" | ||
| judgement = prediction.get("judgement", "") or prediction.get("generation", "") |
There was a problem hiding this comment.
The code tries to extract judgement from either prediction.get("judgement", "") or prediction.get("generation", "") using fallback with or. If "judgement" key exists but has an empty string value, the or will incorrectly fall back to "generation". Use explicit None check: judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "")
| * If Correct: \boxed{{Correct}} | ||
| * If Incorrect: \boxed{{Incorrect}} |
There was a problem hiding this comment.
logic: The prompt outputs \boxed{Correct} or \boxed{Incorrect}, but the is_correct_judgement() function in nemo_skills/evaluation/metrics/utils.py:37 expects Judgement: Yes or Judgement: No format. This means judgements won't be parsed correctly, resulting in all answers being marked as incorrect.
Either:
- Change prompt to output "Judgement: Yes" / "Judgement: No" format (like
judge/math.yaml), or - Update
is_correct_judgement()to also parse\boxed{Correct}/\boxed{Incorrect}format
| 4. **Score Determination:** Compare the findings against the Specific Grading Guidelines and the General Rubric to determine the final score. | ||
|
|
||
| ### Output Requirements | ||
| You must provide your final score in the format <points>N out of 7</points>. Ensure the `<points>` block is used **only once**, as your answer will be parsed based on the first <points> </points> block that appears in your whole response. |
There was a problem hiding this comment.
logic: The prompt expects judges to output <points>N out of 7</points> format, but there's no extraction logic to parse this. The imo-proofbench/__init__.py:17 uses METRICS_TYPE = "math", which relies on is_correct_judgement() that only parses binary "Judgement: Yes/No" format. The 0-7 point scores won't be extracted or evaluated correctly.
|
|
||
| def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]: | ||
| """Extract predicted and expected grades from a prediction.""" | ||
| judgement = prediction.get("judgement", "") or prediction.get("generation", "") |
There was a problem hiding this comment.
logic: Using or for fallback incorrectly handles empty string. If "judgement" key exists but has empty string value, the or operator will fall back to "generation" instead of using the empty string. Use explicit None check:
| judgement = prediction.get("judgement", "") or prediction.get("generation", "") | |
| judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "") |
| # Use empty string only if no base_url provided (uses litellm's default Gemini endpoint) | ||
| # Otherwise pass through the custom base_url |
There was a problem hiding this comment.
logic: Using ternary with falsy check if base_url is incorrect. If base_url="" is explicitly passed, it will be treated as falsy and replaced with "" again. Use explicit None check:
| # Use empty string only if no base_url provided (uses litellm's default Gemini endpoint) | |
| # Otherwise pass through the custom base_url | |
| super().__init__(base_url="" if base_url is None else base_url, *args, **kwargs) |
The imo-gradingbench dataset was using judge/math prompt which expects
{problem}, {predicted_answer}, {expected_answer} fields, but the data
has {problem_statement} and {proof} fields. This mismatch caused the
GPU test to fail with no output files being generated.
Changed to use judge/imo_gradingbench prompt which uses the correct
field names.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Add support for additional judgement formats used by IMO-Bench:
- \boxed{Correct/Incorrect} for IMO AnswerBench
- <points>N out of 7</points> for IMO ProofBench (6-7 = correct, 0-1 = incorrect)
This preserves the original prompt formats from the IMO-Bench paper
while enabling proper parsing in MathMetrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
|
|
||
| def _get_grades(self, prediction: dict) -> tuple[str | None, str | None]: | ||
| """Extract predicted and expected grades from a prediction.""" | ||
| judgement = prediction.get("judgement", "") or prediction.get("generation", "") |
There was a problem hiding this comment.
logic: If "judgement" key exists with empty string value, the or will incorrectly fall back to "generation". Use explicit None check:
| judgement = prediction.get("judgement", "") or prediction.get("generation", "") | |
| judgement = prediction.get("judgement") if "judgement" in prediction else prediction.get("generation", "") |
Signed-off-by: Stephen Ge <stephen.ge@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Stephen Ge <stephen.ge@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.