Conversation
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
📝 WalkthroughWalkthroughThis change introduces a new ProofBench127 dataset module with infrastructure for generating and evaluating mathematical proofs. It includes a module configuration with pipeline settings, a generation task class that processes outputs by stripping self-evaluation content, a data preparation script for converting raw problem entries to JSONL format, and two prompt configurations for deep-seek math proving and proof evaluation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can get early access to new features in CodeRabbit.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
nemo_skills/dataset/proofbench127/__init__.py (1)
15-29: Please add a slurm smoke test for this benchmark wiring.This registration now depends on a custom generation module, prompt pair, judge args, and metrics path. A small end-to-end slurm job for
proofbench127would catchprompt_config/generation_keyregressions early.Based on learnings "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/proofbench127/__init__.py` around lines 15 - 29, Add a Slurm smoke test that runs the proofbench127 benchmark end-to-end to catch regressions in METRICS_TYPE, GENERATION_MODULE, GENERATION_ARGS, JUDGE_ARGS, and JUDGE_PIPELINE_ARGS wiring: create a minimal Slurm job that imports/loads the proofbench127 dataset/registration, invokes the generation pipeline using GENERATION_ARGS (ensure ++prompt_config=generic/deepseek-math-prover and ++generation_key=proof are used) and the judge pipeline using JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure ++generation_key=judgement and inferences settings are applied), run only a tiny sample with limited resources and short timeout, and assert the expected metrics file (METRICS_TYPE path) is produced; place this smoke test alongside other Slurm dataset tests and name it to reference proofbench127 so CI will pick it up.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/proofbench127/generation.py`:
- Around line 33-36: The postprocess_single_output function currently uses
output.get("generation", "") which hides upstream schema errors; change it to
access the key directly (generation = output["generation"]) so a missing
generation raises immediately, then keep the existing truthiness check and call
strip_self_eval(generation) to update output["generation"]; reference the
function postprocess_single_output, the output dict and the strip_self_eval call
when making this change.
In `@nemo_skills/dataset/proofbench127/prepare.py`:
- Around line 22-44: format_entry currently silently returns a half-populated
entry when subset_for_metrics is unknown; modify format_entry to validate subset
(the variable subset derived from raw_entry["subset_for_metrics"]) and add an
explicit else branch that raises a clear exception (e.g., ValueError) naming the
unsupported subset value and the offending raw_entry keys
(problem/problem_id/problem_name) so the caller fails fast instead of emitting
entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing
handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure
metadata is set in each branch before returning.
- Around line 47-58: The current main block opens output_file in "w" before
parsing, risking data loss if processing fails; modify the flow in the __main__
block so you first open input_file, read and validate all lines by calling
format_entry for each, collecting results into a list (e.g., entries), and only
after successful processing open output_file for writing and iterate over
entries to write json.dumps(entry) + "\n"; reference the existing symbols
format_entry, input_file, and output_file when making this change (optionally
write to a temporary file and rename for extra safety).
In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml`:
- Around line 26-37: The prompt template includes inline annotations after the
literal markers (e.g., the "// ..." trailing text after "## Solution" and "##
Self Evaluation") which will be saved by
nemo_skills/dataset/proofbench127/generation.py because it only strips bare "##
Solution" / "## Self Evaluation" markers; remove any inline comments or
annotations that follow those exact markers in deepseek-math-prover.yaml and
replace them with clean placeholders or external documentation (keep only the
exact lines "## Solution" and "## Self Evaluation" followed by empty/newline and
the intended user-visible content) so the saved proofs do not include stray
comment text.
---
Nitpick comments:
In `@nemo_skills/dataset/proofbench127/__init__.py`:
- Around line 15-29: Add a Slurm smoke test that runs the proofbench127
benchmark end-to-end to catch regressions in METRICS_TYPE, GENERATION_MODULE,
GENERATION_ARGS, JUDGE_ARGS, and JUDGE_PIPELINE_ARGS wiring: create a minimal
Slurm job that imports/loads the proofbench127 dataset/registration, invokes the
generation pipeline using GENERATION_ARGS (ensure
++prompt_config=generic/deepseek-math-prover and ++generation_key=proof are
used) and the judge pipeline using JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure
++generation_key=judgement and inferences settings are applied), run only a tiny
sample with limited resources and short timeout, and assert the expected metrics
file (METRICS_TYPE path) is produced; place this smoke test alongside other
Slurm dataset tests and name it to reference proofbench127 so CI will pick it
up.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: af67f1c0-e6b1-4985-90fa-0377b19b2c3a
📒 Files selected for processing (6)
nemo_skills/dataset/proofbench127/__init__.pynemo_skills/dataset/proofbench127/data/problems.txtnemo_skills/dataset/proofbench127/generation.pynemo_skills/dataset/proofbench127/prepare.pynemo_skills/prompt/config/generic/deepseek-math-prover.yamlnemo_skills/prompt/config/judge/math-proof-judge-gt-proof.yaml
| async def postprocess_single_output(self, output, original_data_point): | ||
| generation = output.get("generation", "") | ||
| if generation: | ||
| output["generation"] = strip_self_eval(generation) |
There was a problem hiding this comment.
Fail fast if the generation payload is missing.
output["generation"] should exist here. Using .get() on Line 34 turns an upstream schema problem into a silent no-op, so malformed outputs can slip through without being stripped or flagged.
Suggested fix
async def postprocess_single_output(self, output, original_data_point):
- generation = output.get("generation", "")
+ generation = output["generation"]
if generation:
output["generation"] = strip_self_eval(generation)As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/proofbench127/generation.py` around lines 33 - 36, The
postprocess_single_output function currently uses output.get("generation", "")
which hides upstream schema errors; change it to access the key directly
(generation = output["generation"]) so a missing generation raises immediately,
then keep the existing truthiness check and call strip_self_eval(generation) to
update output["generation"]; reference the function postprocess_single_output,
the output dict and the strip_self_eval call when making this change.
| def format_entry(raw_entry): | ||
| subset = raw_entry["subset_for_metrics"] | ||
|
|
||
| entry = { | ||
| "problem": raw_entry["problem"], | ||
| "problem_id": None, | ||
| "ground_truth_proof": NOT_AVAILABLE, | ||
| "subset_for_metrics": subset, | ||
| } | ||
|
|
||
| if subset == "imoproofbench": | ||
| entry["problem_id"] = raw_entry["problem_id"] | ||
| entry["ground_truth_proof"] = raw_entry["reference_solution"] | ||
| entry["metadata"] = {"category": raw_entry["category"], "level": raw_entry["level"], "source": raw_entry["source"]} | ||
| elif subset == "putnam2025": | ||
| entry["problem_id"] = raw_entry["problem_name"] | ||
| entry["ground_truth_proof"] = raw_entry.get("informal_solution", NOT_AVAILABLE) | ||
| entry["metadata"] = {} | ||
| elif subset == "olympiads2025": | ||
| entry["problem_id"] = raw_entry["problem_id"] | ||
| entry["metadata"] = raw_entry["metadata"] | ||
|
|
||
| return entry |
There was a problem hiding this comment.
Reject unknown subsets instead of emitting half-populated rows.
If subset_for_metrics is mistyped or a new subset shows up, this function falls through with problem_id=None and placeholder proof text, which silently corrupts the JSONL. Add an explicit else so unsupported subset values fail fast.
Suggested fix
elif subset == "olympiads2025":
entry["problem_id"] = raw_entry["problem_id"]
entry["metadata"] = raw_entry["metadata"]
+ else:
+ raise ValueError(f"Unsupported subset_for_metrics: {subset}")As per coding guidelines "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 22 - 44,
format_entry currently silently returns a half-populated entry when
subset_for_metrics is unknown; modify format_entry to validate subset (the
variable subset derived from raw_entry["subset_for_metrics"]) and add an
explicit else branch that raises a clear exception (e.g., ValueError) naming the
unsupported subset value and the offending raw_entry keys
(problem/problem_id/problem_name) so the caller fails fast instead of emitting
entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing
handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure
metadata is set in each branch before returning.
| if __name__ == "__main__": | ||
| data_dir = Path(__file__).absolute().parent | ||
| input_file = data_dir / "data" / "problems.txt" | ||
| output_file = data_dir / "test.jsonl" | ||
|
|
||
| with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout: | ||
| for line in fin: | ||
| raw_entry = json.loads(line.strip()) | ||
| entry = format_entry(raw_entry) | ||
| fout.write(json.dumps(entry) + "\n") | ||
|
|
||
| print(f"Wrote {output_file}") |
There was a problem hiding this comment.
Build every entry before truncating test.jsonl.
Opening the output file in "w" mode before parsing and formatting completes means one bad line can destroy the previous dataset and leave a partial file behind. Read/validate everything first, then open the destination for writing.
Suggested fix
- with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
- for line in fin:
- raw_entry = json.loads(line.strip())
- entry = format_entry(raw_entry)
- fout.write(json.dumps(entry) + "\n")
+ entries = []
+ with open(input_file, "r", encoding="utf-8") as fin:
+ for line in fin:
+ raw_entry = json.loads(line.strip())
+ entries.append(format_entry(raw_entry))
+
+ with open(output_file, "w", encoding="utf-8") as fout:
+ for entry in entries:
+ fout.write(json.dumps(entry) + "\n")As per coding guidelines "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if __name__ == "__main__": | |
| data_dir = Path(__file__).absolute().parent | |
| input_file = data_dir / "data" / "problems.txt" | |
| output_file = data_dir / "test.jsonl" | |
| with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout: | |
| for line in fin: | |
| raw_entry = json.loads(line.strip()) | |
| entry = format_entry(raw_entry) | |
| fout.write(json.dumps(entry) + "\n") | |
| print(f"Wrote {output_file}") | |
| if __name__ == "__main__": | |
| data_dir = Path(__file__).absolute().parent | |
| input_file = data_dir / "data" / "problems.txt" | |
| output_file = data_dir / "test.jsonl" | |
| entries = [] | |
| with open(input_file, "r", encoding="utf-8") as fin: | |
| for line in fin: | |
| raw_entry = json.loads(line.strip()) | |
| entries.append(format_entry(raw_entry)) | |
| with open(output_file, "w", encoding="utf-8") as fout: | |
| for entry in entries: | |
| fout.write(json.dumps(entry) + "\n") | |
| print(f"Wrote {output_file}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 47 - 58, The
current main block opens output_file in "w" before parsing, risking data loss if
processing fails; modify the flow in the __main__ block so you first open
input_file, read and validate all lines by calling format_entry for each,
collecting results into a list (e.g., entries), and only after successful
processing open output_file for writing and iterate over entries to write
json.dumps(entry) + "\n"; reference the existing symbols format_entry,
input_file, and output_file when making this change (optionally write to a
temporary file and rename for extra safety).
| Your final response should be in the following format: | ||
|
|
||
| ## Solution // Your final solution should start with this exact same markdown title | ||
| ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here. | ||
|
|
||
| ## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title | ||
|
|
||
| Here is my evaluation of the solution: // Your analysis should start with this exact same phrase | ||
| ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out. | ||
|
|
||
| Based on my evaluation, the final overall score should be: | ||
| \boxed{{...}} // where ... should be the final overall score (0, 0.5, or 1, and nothing else) based on the evaluation instruction above. You should reach this score ONLY AFTER careful RE-examination of your own solution above |
There was a problem hiding this comment.
Keep the literal output template free of inline annotations.
nemo_skills/dataset/proofbench127/generation.py strips on the bare ## Solution / ## Self Evaluation markers only. If the model copies Lines 28-37 literally, the // ... text after ## Solution becomes part of the saved proof and leaks into judging.
Suggested cleanup
- ## Solution // Your final solution should start with this exact same markdown title
- ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here.
+ Use exactly this structure:
+
+ ## Solution
+ ...
- ## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title
+ ## Self Evaluation
- Here is my evaluation of the solution: // Your analysis should start with this exact same phrase
- ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out.
+ Here is my evaluation of the solution:
+ ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml` around lines 26
- 37, The prompt template includes inline annotations after the literal markers
(e.g., the "// ..." trailing text after "## Solution" and "## Self Evaluation")
which will be saved by nemo_skills/dataset/proofbench127/generation.py because
it only strips bare "## Solution" / "## Self Evaluation" markers; remove any
inline comments or annotations that follow those exact markers in
deepseek-math-prover.yaml and replace them with clean placeholders or external
documentation (keep only the exact lines "## Solution" and "## Self Evaluation"
followed by empty/newline and the intended user-visible content) so the saved
proofs do not include stray comment text.
Summary by CodeRabbit
Release Notes
New Features
Chores