Skip to content

Add proofbench127#1307

Closed
smahdavi4 wants to merge 1 commit intomainfrom
proofbench127
Closed

Add proofbench127#1307
smahdavi4 wants to merge 1 commit intomainfrom
proofbench127

Conversation

@smahdavi4
Copy link
Collaborator

@smahdavi4 smahdavi4 commented Mar 14, 2026

Summary by CodeRabbit

Release Notes

New Features

  • Added ProofBench127 dataset with comprehensive generation and evaluation pipelines for mathematical proofs.
  • Supports problem sources from IMO Proof Bench, Putnam 2025, and Olympiad 2025 competitions.
  • Includes automated data preparation to standardize problem entries.

Chores

  • Added configuration files for proof generation prompts and judging workflows.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

📝 Walkthrough

Walkthrough

This change introduces a new ProofBench127 dataset module with infrastructure for generating and evaluating mathematical proofs. It includes a module configuration with pipeline settings, a generation task class that processes outputs by stripping self-evaluation content, a data preparation script for converting raw problem entries to JSONL format, and two prompt configurations for deep-seek math proving and proof evaluation.

Changes

Cohort / File(s) Summary
ProofBench127 Module Configuration
nemo_skills/dataset/proofbench127/__init__.py
Defines constants for metrics type, generation module path, and pipeline arguments including generation prompt config (deepseek-math-prover), judge arguments, and OpenAI model/server configuration for inference.
Generation Task Implementation
nemo_skills/dataset/proofbench127/generation.py
Introduces ProofbenchGenerationTask class extending GenerationTask with custom postprocessing to strip self-evaluation content via strip_self_eval() helper. Defines solution and self-eval header constants, exposes task class alias, and provides Hydra-driven generate() entry point with auto-generated help message.
Data Preparation
nemo_skills/dataset/proofbench127/prepare.py
Implements format_entry() function to convert raw problem entries into standardized JSONL format with conditional logic for three problem subsets (imoproofbench, putnam2025, olympiads2025). Includes __main__ block to read from data/problems.txt and write formatted entries to test.jsonl.
Prompt Configurations
nemo_skills/prompt/config/generic/deepseek-math-prover.yaml, nemo_skills/prompt/config/judge/math-proof-judge-gt-proof.yaml
Adds two YAML prompt templates: deepseek-math-prover instructs comprehensive solution generation with self-evaluation and scoring rubric; math-proof-judge-gt-proof guides proof validation with step-by-step reasoning checks, LaTeX requirements, and structured judgment output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add proofbench127' directly corresponds to the main change in the pull request, which introduces a new proofbench127 dataset module with configuration, generation, and data preparation components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch proofbench127
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
nemo_skills/dataset/proofbench127/__init__.py (1)

15-29: Please add a slurm smoke test for this benchmark wiring.

This registration now depends on a custom generation module, prompt pair, judge args, and metrics path. A small end-to-end slurm job for proofbench127 would catch prompt_config / generation_key regressions early.

Based on learnings "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/proofbench127/__init__.py` around lines 15 - 29, Add a
Slurm smoke test that runs the proofbench127 benchmark end-to-end to catch
regressions in METRICS_TYPE, GENERATION_MODULE, GENERATION_ARGS, JUDGE_ARGS, and
JUDGE_PIPELINE_ARGS wiring: create a minimal Slurm job that imports/loads the
proofbench127 dataset/registration, invokes the generation pipeline using
GENERATION_ARGS (ensure ++prompt_config=generic/deepseek-math-prover and
++generation_key=proof are used) and the judge pipeline using
JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure ++generation_key=judgement and inferences
settings are applied), run only a tiny sample with limited resources and short
timeout, and assert the expected metrics file (METRICS_TYPE path) is produced;
place this smoke test alongside other Slurm dataset tests and name it to
reference proofbench127 so CI will pick it up.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/proofbench127/generation.py`:
- Around line 33-36: The postprocess_single_output function currently uses
output.get("generation", "") which hides upstream schema errors; change it to
access the key directly (generation = output["generation"]) so a missing
generation raises immediately, then keep the existing truthiness check and call
strip_self_eval(generation) to update output["generation"]; reference the
function postprocess_single_output, the output dict and the strip_self_eval call
when making this change.

In `@nemo_skills/dataset/proofbench127/prepare.py`:
- Around line 22-44: format_entry currently silently returns a half-populated
entry when subset_for_metrics is unknown; modify format_entry to validate subset
(the variable subset derived from raw_entry["subset_for_metrics"]) and add an
explicit else branch that raises a clear exception (e.g., ValueError) naming the
unsupported subset value and the offending raw_entry keys
(problem/problem_id/problem_name) so the caller fails fast instead of emitting
entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing
handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure
metadata is set in each branch before returning.
- Around line 47-58: The current main block opens output_file in "w" before
parsing, risking data loss if processing fails; modify the flow in the __main__
block so you first open input_file, read and validate all lines by calling
format_entry for each, collecting results into a list (e.g., entries), and only
after successful processing open output_file for writing and iterate over
entries to write json.dumps(entry) + "\n"; reference the existing symbols
format_entry, input_file, and output_file when making this change (optionally
write to a temporary file and rename for extra safety).

In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml`:
- Around line 26-37: The prompt template includes inline annotations after the
literal markers (e.g., the "// ..." trailing text after "## Solution" and "##
Self Evaluation") which will be saved by
nemo_skills/dataset/proofbench127/generation.py because it only strips bare "##
Solution" / "## Self Evaluation" markers; remove any inline comments or
annotations that follow those exact markers in deepseek-math-prover.yaml and
replace them with clean placeholders or external documentation (keep only the
exact lines "## Solution" and "## Self Evaluation" followed by empty/newline and
the intended user-visible content) so the saved proofs do not include stray
comment text.

---

Nitpick comments:
In `@nemo_skills/dataset/proofbench127/__init__.py`:
- Around line 15-29: Add a Slurm smoke test that runs the proofbench127
benchmark end-to-end to catch regressions in METRICS_TYPE, GENERATION_MODULE,
GENERATION_ARGS, JUDGE_ARGS, and JUDGE_PIPELINE_ARGS wiring: create a minimal
Slurm job that imports/loads the proofbench127 dataset/registration, invokes the
generation pipeline using GENERATION_ARGS (ensure
++prompt_config=generic/deepseek-math-prover and ++generation_key=proof are
used) and the judge pipeline using JUDGE_ARGS/JUDGE_PIPELINE_ARGS (ensure
++generation_key=judgement and inferences settings are applied), run only a tiny
sample with limited resources and short timeout, and assert the expected metrics
file (METRICS_TYPE path) is produced; place this smoke test alongside other
Slurm dataset tests and name it to reference proofbench127 so CI will pick it
up.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af67f1c0-e6b1-4985-90fa-0377b19b2c3a

📥 Commits

Reviewing files that changed from the base of the PR and between 86071c1 and 628dc82.

📒 Files selected for processing (6)
  • nemo_skills/dataset/proofbench127/__init__.py
  • nemo_skills/dataset/proofbench127/data/problems.txt
  • nemo_skills/dataset/proofbench127/generation.py
  • nemo_skills/dataset/proofbench127/prepare.py
  • nemo_skills/prompt/config/generic/deepseek-math-prover.yaml
  • nemo_skills/prompt/config/judge/math-proof-judge-gt-proof.yaml

Comment on lines +33 to +36
async def postprocess_single_output(self, output, original_data_point):
generation = output.get("generation", "")
if generation:
output["generation"] = strip_self_eval(generation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast if the generation payload is missing.

output["generation"] should exist here. Using .get() on Line 34 turns an upstream schema problem into a silent no-op, so malformed outputs can slip through without being stripped or flagged.

Suggested fix
     async def postprocess_single_output(self, output, original_data_point):
-        generation = output.get("generation", "")
+        generation = output["generation"]
         if generation:
             output["generation"] = strip_self_eval(generation)

As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/proofbench127/generation.py` around lines 33 - 36, The
postprocess_single_output function currently uses output.get("generation", "")
which hides upstream schema errors; change it to access the key directly
(generation = output["generation"]) so a missing generation raises immediately,
then keep the existing truthiness check and call strip_self_eval(generation) to
update output["generation"]; reference the function postprocess_single_output,
the output dict and the strip_self_eval call when making this change.

Comment on lines +22 to +44
def format_entry(raw_entry):
subset = raw_entry["subset_for_metrics"]

entry = {
"problem": raw_entry["problem"],
"problem_id": None,
"ground_truth_proof": NOT_AVAILABLE,
"subset_for_metrics": subset,
}

if subset == "imoproofbench":
entry["problem_id"] = raw_entry["problem_id"]
entry["ground_truth_proof"] = raw_entry["reference_solution"]
entry["metadata"] = {"category": raw_entry["category"], "level": raw_entry["level"], "source": raw_entry["source"]}
elif subset == "putnam2025":
entry["problem_id"] = raw_entry["problem_name"]
entry["ground_truth_proof"] = raw_entry.get("informal_solution", NOT_AVAILABLE)
entry["metadata"] = {}
elif subset == "olympiads2025":
entry["problem_id"] = raw_entry["problem_id"]
entry["metadata"] = raw_entry["metadata"]

return entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unknown subsets instead of emitting half-populated rows.

If subset_for_metrics is mistyped or a new subset shows up, this function falls through with problem_id=None and placeholder proof text, which silently corrupts the JSONL. Add an explicit else so unsupported subset values fail fast.

Suggested fix
     elif subset == "olympiads2025":
         entry["problem_id"] = raw_entry["problem_id"]
         entry["metadata"] = raw_entry["metadata"]
+    else:
+        raise ValueError(f"Unsupported subset_for_metrics: {subset}")

As per coding guidelines "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 22 - 44,
format_entry currently silently returns a half-populated entry when
subset_for_metrics is unknown; modify format_entry to validate subset (the
variable subset derived from raw_entry["subset_for_metrics"]) and add an
explicit else branch that raises a clear exception (e.g., ValueError) naming the
unsupported subset value and the offending raw_entry keys
(problem/problem_id/problem_name) so the caller fails fast instead of emitting
entries with problem_id=None and ground_truth_proof=NOT_AVAILABLE; keep existing
handling for "imoproofbench", "putnam2025", and "olympiads2025" and ensure
metadata is set in each branch before returning.

Comment on lines +47 to +58
if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
input_file = data_dir / "data" / "problems.txt"
output_file = data_dir / "test.jsonl"

with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
for line in fin:
raw_entry = json.loads(line.strip())
entry = format_entry(raw_entry)
fout.write(json.dumps(entry) + "\n")

print(f"Wrote {output_file}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Build every entry before truncating test.jsonl.

Opening the output file in "w" mode before parsing and formatting completes means one bad line can destroy the previous dataset and leave a partial file behind. Read/validate everything first, then open the destination for writing.

Suggested fix
-    with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
-        for line in fin:
-            raw_entry = json.loads(line.strip())
-            entry = format_entry(raw_entry)
-            fout.write(json.dumps(entry) + "\n")
+    entries = []
+    with open(input_file, "r", encoding="utf-8") as fin:
+        for line in fin:
+            raw_entry = json.loads(line.strip())
+            entries.append(format_entry(raw_entry))
+
+    with open(output_file, "w", encoding="utf-8") as fout:
+        for entry in entries:
+            fout.write(json.dumps(entry) + "\n")

As per coding guidelines "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
input_file = data_dir / "data" / "problems.txt"
output_file = data_dir / "test.jsonl"
with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
for line in fin:
raw_entry = json.loads(line.strip())
entry = format_entry(raw_entry)
fout.write(json.dumps(entry) + "\n")
print(f"Wrote {output_file}")
if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
input_file = data_dir / "data" / "problems.txt"
output_file = data_dir / "test.jsonl"
entries = []
with open(input_file, "r", encoding="utf-8") as fin:
for line in fin:
raw_entry = json.loads(line.strip())
entries.append(format_entry(raw_entry))
with open(output_file, "w", encoding="utf-8") as fout:
for entry in entries:
fout.write(json.dumps(entry) + "\n")
print(f"Wrote {output_file}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/proofbench127/prepare.py` around lines 47 - 58, The
current main block opens output_file in "w" before parsing, risking data loss if
processing fails; modify the flow in the __main__ block so you first open
input_file, read and validate all lines by calling format_entry for each,
collecting results into a list (e.g., entries), and only after successful
processing open output_file for writing and iterate over entries to write
json.dumps(entry) + "\n"; reference the existing symbols format_entry,
input_file, and output_file when making this change (optionally write to a
temporary file and rename for extra safety).

Comment on lines +26 to +37
Your final response should be in the following format:

## Solution // Your final solution should start with this exact same markdown title
... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here.

## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title

Here is my evaluation of the solution: // Your analysis should start with this exact same phrase
... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out.

Based on my evaluation, the final overall score should be:
\boxed{{...}} // where ... should be the final overall score (0, 0.5, or 1, and nothing else) based on the evaluation instruction above. You should reach this score ONLY AFTER careful RE-examination of your own solution above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the literal output template free of inline annotations.

nemo_skills/dataset/proofbench127/generation.py strips on the bare ## Solution / ## Self Evaluation markers only. If the model copies Lines 28-37 literally, the // ... text after ## Solution becomes part of the saved proof and leaks into judging.

Suggested cleanup
-  ## Solution // Your final solution should start with this exact same markdown title
-  ... // Your final solution to the problem here. You should try your best to optimize the quality of your solution according to the evaluation instruction above before finalizing it here.
+  Use exactly this structure:
+
+  ## Solution
+  ...

-  ## Self Evaluation // Your evaluation of your own solution above should start with this exact same markdown title
+  ## Self Evaluation

-  Here is my evaluation of the solution: // Your analysis should start with this exact same phrase
-  ... // Your evaluation here. You are required to present in detail the key steps of the solution or the steps for which you had doubts regarding their correctness, and explicitly analyze whether each step is accurate: for correct steps, explain why you initially doubted their correctness and why they are indeed correct; for erroneous steps, explain the reason for the error and the impact of that error on the solution. You should analyze your solution faithfully. E.g., if there are issues in your final solution, you should point it out.
+  Here is my evaluation of the solution:
+  ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/prompt/config/generic/deepseek-math-prover.yaml` around lines 26
- 37, The prompt template includes inline annotations after the literal markers
(e.g., the "// ..." trailing text after "## Solution" and "## Self Evaluation")
which will be saved by nemo_skills/dataset/proofbench127/generation.py because
it only strips bare "## Solution" / "## Self Evaluation" markers; remove any
inline comments or annotations that follow those exact markers in
deepseek-math-prover.yaml and replace them with clean placeholders or external
documentation (keep only the exact lines "## Solution" and "## Self Evaluation"
followed by empty/newline and the intended user-visible content) so the saved
proofs do not include stray comment text.

@Kipok Kipok closed this Mar 16, 2026
@Kipok Kipok deleted the proofbench127 branch March 16, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants