Add proofbench127 by smahdavi4 · Pull Request #1307 · NVIDIA-NeMo/Skills

smahdavi4 · 2026-03-14T03:10:57Z

Summary by CodeRabbit

Release Notes

New Features

Added ProofBench127 dataset with comprehensive generation and evaluation pipelines for mathematical proofs.
Supports problem sources from IMO Proof Bench, Putnam 2025, and Olympiad 2025 competitions.
Includes automated data preparation to standardize problem entries.

Chores

Added configuration files for proof generation prompts and judging workflows.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

coderabbitai · 2026-03-14T03:20:09Z

📝 Walkthrough

Walkthrough

This change introduces a new ProofBench127 dataset module with infrastructure for generating and evaluating mathematical proofs. It includes a module configuration with pipeline settings, a generation task class that processes outputs by stripping self-evaluation content, a data preparation script for converting raw problem entries to JSONL format, and two prompt configurations for deep-seek math proving and proof evaluation.

Changes

Cohort / File(s)	Summary
ProofBench127 Module Configuration `nemo_skills/dataset/proofbench127/__init__.py`	Defines constants for metrics type, generation module path, and pipeline arguments including generation prompt config (deepseek-math-prover), judge arguments, and OpenAI model/server configuration for inference.
Generation Task Implementation `nemo_skills/dataset/proofbench127/generation.py`	Introduces `ProofbenchGenerationTask` class extending `GenerationTask` with custom postprocessing to strip self-evaluation content via `strip_self_eval()` helper. Defines solution and self-eval header constants, exposes task class alias, and provides Hydra-driven `generate()` entry point with auto-generated help message.
Data Preparation `nemo_skills/dataset/proofbench127/prepare.py`	Implements `format_entry()` function to convert raw problem entries into standardized JSONL format with conditional logic for three problem subsets (imoproofbench, putnam2025, olympiads2025). Includes `__main__` block to read from `data/problems.txt` and write formatted entries to `test.jsonl`.
Prompt Configurations `nemo_skills/prompt/config/generic/deepseek-math-prover.yaml`, `nemo_skills/prompt/config/judge/math-proof-judge-gt-proof.yaml`	Adds two YAML prompt templates: deepseek-math-prover instructs comprehensive solution generation with self-evaluation and scoring rubric; math-proof-judge-gt-proof guides proof validation with step-by-step reasoning checks, LaTeX requirements, and structured judgment output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add proofbench127' directly corresponds to the main change in the pull request, which introduces a new proofbench127 dataset module with configuration, generation, and data preparation components.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch proofbench127

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

nemo_skills/dataset/proofbench127/__init__.py (1)
15-29: Please add a slurm smoke test for this benchmark wiring.

This registration now depends on a custom generation module, prompt pair, judge args, and metrics path. A small end-to-end slurm job for proofbench127 would catch prompt_config / generation_key regressions early.

Based on learnings "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/proofbench127/__init__.py` around lines 15 - 29, Add a
Slurm smoke test that runs the proofbench127 benchmark end-to-end to catch
regressions in METRICS_TYPE, GENERATION_MODULE, GENERATION_ARGS, JUDGE_ARGS, and
JUDGE_PIPELINE_ARGS wiring: create a minimal Slurm job that imports/loads the
proofbench127 dataset/registration, invokes the generation pipeline using
GENERATION_ARGS (ensure ++prompt_config=generic/deepseek-math-prover and
++generation_key=proof are used) and the judge pipeline using
JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure ++generation_key=judgement and inferences
settings are applied), run only a tiny sample with limited resources and short
timeout, and assert the expected metrics file (METRICS_TYPE path) is produced;
place this smoke test alongside other Slurm dataset tests and name it to
reference proofbench127 so CI will pick it up.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/proofbench127/generation.py`:
- Around line 33-36: The postprocess_single_output function currently uses
output.get("generation", "") which hides upstream schema errors; change it to
access the key directly (generation = output["generation"]) so a missing
generation raises immediately, then keep the existing truthiness check and call
strip_self_eval(generation) to update output["generation"]; reference the
function postprocess_single_output, the output dict and the strip_self_eval call
when making this change.

In `@nemo_skills/dataset/proofbench127/prepare.py`:
- Around line 22-44: format_entry currently silently returns a half-populated
entry when subset_for_metrics is unknown; modify format_entry to validate subset
(the variable subset derived from raw_entry["subset_for_metrics"]) and add an
explicit else branch that raises a clear exception (e.g., ValueError) naming the
unsupported subset value and the offending raw_entry keys
(problem/problem_id/problem_name) so the caller fails fast instead of emitting
entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing
handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure
metadata is set in each branch before returning.
- Around line 47-58: The current main block opens output_file in "w" before
parsing, risking data loss if processing fails; modify the flow in the __main__
block so you first open input_file, read and validate all lines by calling
format_entry for each, collecting results into a list (e.g., entries), and only
after successful processing open output_file for writing and iterate over
entries to write json.dumps(entry) + "\n"; reference the existing symbols
format_entry, input_file, and output_file when making this change (optionally
write to a temporary file and rename for extra safety).

In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml`:
- Around line 26-37: The prompt template includes inline annotations after the
literal markers (e.g., the "// ..." trailing text after "## Solution" and "##
Self Evaluation") which will be saved by
nemo_skills/dataset/proofbench127/generation.py because it only strips bare "##
Solution" / "## Self Evaluation" markers; remove any inline comments or
annotations that follow those exact markers in deepseek-math-prover.yaml and
replace them with clean placeholders or external documentation (keep only the
exact lines "## Solution" and "## Self Evaluation" followed by empty/newline and
the intended user-visible content) so the saved proofs do not include stray
comment text.

---

Nitpick comments:
In `@nemo_skills/dataset/proofbench127/__init__.py`:
- Around line 15-29: Add a Slurm smoke test that runs the proofbench127
benchmark end-to-end to catch regressions in METRICS_TYPE, GENERATION_MODULE,
GENERATION_ARGS, JUDGE_ARGS, and JUDGE_PIPELINE_ARGS wiring: create a minimal
Slurm job that imports/loads the proofbench127 dataset/registration, invokes the
generation pipeline using GENERATION_ARGS (ensure
++prompt_config=generic/deepseek-math-prover and ++generation_key=proof are
used) and the judge pipeline using JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure
++generation_key=judgement and inferences settings are applied), run only a tiny
sample with limited resources and short timeout, and assert the expected metrics
file (METRICS_TYPE path) is produced; place this smoke test alongside other
Slurm dataset tests and name it to reference proofbench127 so CI will pick it
up.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af67f1c0-e6b1-4985-90fa-0377b19b2c3a

📥 Commits

Reviewing files that changed from the base of the PR and between 86071c1 and 628dc82.

📒 Files selected for processing (6)

nemo_skills/dataset/proofbench127/__init__.py
nemo_skills/dataset/proofbench127/data/problems.txt
nemo_skills/dataset/proofbench127/generation.py
nemo_skills/dataset/proofbench127/prepare.py
nemo_skills/prompt/config/generic/deepseek-math-prover.yaml
nemo_skills/prompt/config/judge/math-proof-judge-gt-proof.yaml

coderabbitai · 2026-03-14T03:20:12Z

nemo_skills/dataset/proofbench127/generation.py

+    async def postprocess_single_output(self, output, original_data_point):
+        generation = output.get("generation", "")
+        if generation:
+            output["generation"] = strip_self_eval(generation)


⚠️ Potential issue | 🟡 Minor

Fail fast if the generation payload is missing.

output["generation"] should exist here. Using .get() on Line 34 turns an upstream schema problem into a silent no-op, so malformed outputs can slip through without being stripped or flagged.

Suggested fix

async def postprocess_single_output(self, output, original_data_point): - generation = output.get("generation", "") + generation = output["generation"] if generation: output["generation"] = strip_self_eval(generation)

As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/proofbench127/generation.py` around lines 33 - 36, The postprocess_single_output function currently uses output.get("generation", "") which hides upstream schema errors; change it to access the key directly (generation = output["generation"]) so a missing generation raises immediately, then keep the existing truthiness check and call strip_self_eval(generation) to update output["generation"]; reference the function postprocess_single_output, the output dict and the strip_self_eval call when making this change.

coderabbitai · 2026-03-14T03:20:12Z

nemo_skills/dataset/proofbench127/prepare.py

+def format_entry(raw_entry):
+    subset = raw_entry["subset_for_metrics"]
+
+    entry = {
+        "problem": raw_entry["problem"],
+        "problem_id": None,
+        "ground_truth_proof": NOT_AVAILABLE,
+        "subset_for_metrics": subset,
+    }
+
+    if subset == "imoproofbench":
+        entry["problem_id"] = raw_entry["problem_id"]
+        entry["ground_truth_proof"] = raw_entry["reference_solution"]
+        entry["metadata"] = {"category": raw_entry["category"], "level": raw_entry["level"], "source": raw_entry["source"]}
+    elif subset == "putnam2025":
+        entry["problem_id"] = raw_entry["problem_name"]
+        entry["ground_truth_proof"] = raw_entry.get("informal_solution", NOT_AVAILABLE)
+        entry["metadata"] = {}
+    elif subset == "olympiads2025":
+        entry["problem_id"] = raw_entry["problem_id"]
+        entry["metadata"] = raw_entry["metadata"]
+
+    return entry


⚠️ Potential issue | 🟠 Major

Reject unknown subsets instead of emitting half-populated rows.

If subset_for_metrics is mistyped or a new subset shows up, this function falls through with problem_id=None and placeholder proof text, which silently corrupts the JSONL. Add an explicit else so unsupported subset values fail fast.

Suggested fix

elif subset == "olympiads2025": entry["problem_id"] = raw_entry["problem_id"] entry["metadata"] = raw_entry["metadata"] + else: + raise ValueError(f"Unsupported subset_for_metrics: {subset}")

As per coding guidelines "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 22 - 44, format_entry currently silently returns a half-populated entry when subset_for_metrics is unknown; modify format_entry to validate subset (the variable subset derived from raw_entry["subset_for_metrics"]) and add an explicit else branch that raises a clear exception (e.g., ValueError) naming the unsupported subset value and the offending raw_entry keys (problem/problem_id/problem_name) so the caller fails fast instead of emitting entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure metadata is set in each branch before returning.

coderabbitai · 2026-03-14T03:20:12Z

nemo_skills/dataset/proofbench127/prepare.py

+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    input_file = data_dir / "data" / "problems.txt"
+    output_file = data_dir / "test.jsonl"
+
+    with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
+        for line in fin:
+            raw_entry = json.loads(line.strip())
+            entry = format_entry(raw_entry)
+            fout.write(json.dumps(entry) + "\n")
+
+    print(f"Wrote {output_file}")


⚠️ Potential issue | 🟠 Major

Build every entry before truncating test.jsonl.

Opening the output file in "w" mode before parsing and formatting completes means one bad line can destroy the previous dataset and leave a partial file behind. Read/validate everything first, then open the destination for writing.

Suggested fix

- with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout: - for line in fin: - raw_entry = json.loads(line.strip()) - entry = format_entry(raw_entry) - fout.write(json.dumps(entry) + "\n") + entries = [] + with open(input_file, "r", encoding="utf-8") as fin: + for line in fin: + raw_entry = json.loads(line.strip()) + entries.append(format_entry(raw_entry)) + + with open(output_file, "w", encoding="utf-8") as fout: + for entry in entries: + fout.write(json.dumps(entry) + "\n")

As per coding guidelines "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if __name__ == "__main__":

data_dir = Path(__file__).absolute().parent

input_file = data_dir / "data" / "problems.txt"

output_file = data_dir / "test.jsonl"

with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:

for line in fin:

raw_entry = json.loads(line.strip())

entry = format_entry(raw_entry)

fout.write(json.dumps(entry) + "\n")

print(f"Wrote {output_file}")

if __name__ == "__main__":

data_dir = Path(__file__).absolute().parent

input_file = data_dir / "data" / "problems.txt"

output_file = data_dir / "test.jsonl"

entries = []

with open(input_file, "r", encoding="utf-8") as fin:

for line in fin:

raw_entry = json.loads(line.strip())

entries.append(format_entry(raw_entry))

with open(output_file, "w", encoding="utf-8") as fout:

for entry in entries:

fout.write(json.dumps(entry) + "\n")

print(f"Wrote {output_file}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 47 - 58, The current main block opens output_file in "w" before parsing, risking data loss if processing fails; modify the flow in the __main__ block so you first open input_file, read and validate all lines by calling format_entry for each, collecting results into a list (e.g., entries), and only after successful processing open output_file for writing and iterate over entries to write json.dumps(entry) + "\n"; reference the existing symbols format_entry, input_file, and output_file when making this change (optionally write to a temporary file and rename for extra safety).

coderabbitai · 2026-03-14T03:20:12Z

nemo_skills/prompt/config/generic/deepseek-math-prover.yaml

+  Your final response should be in the following format:
+
+  ## Solution // Your final solution should start with this exact same markdown title
+  ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here.
+
+  ## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title
+
+  Here is my evaluation of the solution: // Your analysis should start with this exact same phrase
+  ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out.
+
+  Based on my evaluation, the final overall score should be:
+  \boxed{{...}} // where ... should be the final overall score (0, 0.5, or 1, and nothing else) based on the evaluation instruction above. You should reach this score ONLY AFTER careful RE-examination of your own solution above


⚠️ Potential issue | 🟡 Minor

Keep the literal output template free of inline annotations.

nemo_skills/dataset/proofbench127/generation.py strips on the bare ## Solution / ## Self Evaluation markers only. If the model copies Lines 28-37 literally, the // ... text after ## Solution becomes part of the saved proof and leaks into judging.

Suggested cleanup

- ## Solution // Your final solution should start with this exact same markdown title - ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here. + Use exactly this structure: + + ## Solution + ... - ## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title + ## Self Evaluation - Here is my evaluation of the solution: // Your analysis should start with this exact same phrase - ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out. + Here is my evaluation of the solution: + ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml` around lines 26 - 37, The prompt template includes inline annotations after the literal markers (e.g., the "// ..." trailing text after "## Solution" and "## Self Evaluation") which will be saved by nemo_skills/dataset/proofbench127/generation.py because it only strips bare "## Solution" / "## Self Evaluation" markers; remove any inline comments or annotations that follow those exact markers in deepseek-math-prover.yaml and replace them with clean placeholders or external documentation (keep only the exact lines "## Solution" and "## Self Evaluation" followed by empty/newline and the intended user-visible content) so the saved proofs do not include stray comment text.

Add proofbench127

628dc82

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

Kipok closed this Mar 16, 2026

Kipok deleted the proofbench127 branch March 16, 2026 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proofbench127#1307

Add proofbench127#1307
smahdavi4 wants to merge 1 commit intomainfrom
proofbench127

smahdavi4 commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 14, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smahdavi4 commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 14, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smahdavi4 commented Mar 14, 2026 •

edited by coderabbitai bot

Loading