Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions docs/releases/nemotronmathproofs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
---
date: 2025-12-15
---

# Nemotron-Math-Proofs

## Dataset Overview

Using our pipelines we created [Nemotron-Math-Proofs-v1](https://huggingface.co/datasets/nvidia/Nemotron-Math-Proofs-v1),
a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems,
550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories
culminating in Lean 4 proofs.

The dataset integrates human-authored problems with systematically generated formalizations and solution traces:

* **Natural Language Problems**: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow,
semantically deduplicated and decontaminated against popular benchmarks
* **Formal Statements**: ~550k Lean 4 theorem statements generated via autoformalization
* **Proof Trajectories**: ~900k verified reasoning traces and proofs

Each natural language problem is formalized by [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) into
Lean 4 theorem statements. Reasoning traces and proofs are generated by
[Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) and verified by the Lean 4 compiler.

This dataset was used as part of the SFT data for [Nemotron-Nano-v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3).

## Training Results

We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F:

| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|-------|------------------------------|-------------------------------|
| Goedel-Prover-V2-8B | 84.6% | 86.7% |
| Qwen3-8B SFT on Nemotron-Math-Proofs-v1 | 85.3% | 90.2% |

Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F:

| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|-------|------------------------------|-------------------------------|
| Nemotron-Nano-v3 | 79.92% | 86.89% |
| gpt-oss-20b | 43.03% | 59.42% |
| Qwen3-30B-A3B-Thinking | 16.80% | 32.37% |

## How to Reproduce

Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation.

!!! note

These commands assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md).
Adjust paths and cluster settings according to your environment.

### Autoformalization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's going to be better, but consider splitting into multiple pages and linking here


The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative
refinement process with backtranslation verification. The input is natural language math problems—see
[OpenMathReasoning dataset construction](../openmathreasoning/dataset.md) for how to prepare these.

=== "CLI"

```bash
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.autoformalize \
--model=openai/gpt-oss-120b \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--input_file=/workspace/data/problems.jsonl \
--output_dir=/workspace/data/autoformalize_output \
--with_sandbox \
--num_random_seeds=1 \
++prompt_config=lean4/autoformalization \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++adaptive_reasoning=True \
++refinement=True \
++judge_enabled=True \
++refinement_max_turns=8 \
++backtranslation_prompt_config=lean4/backtranslation \
++judge_prompt_config=lean4/judge-backtranslation \
++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \
++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \
++refine_code_error_prompt_config=lean4/refinement_code_error
```

=== "Python"

```python
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=120000 "
"++prompt_config=lean4/autoformalization "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
"++adaptive_reasoning=True "
"++refinement=True "
"++judge_enabled=True "
"++refinement_max_turns=8 "
"++backtranslation_prompt_config=lean4/backtranslation "
"++judge_prompt_config=lean4/judge-backtranslation "
"++refine_consistent_error_prompt_config=lean4/refinement_consistent_error "
"++refine_parsing_error_prompt_config=lean4/refinement_parsing_error "
"++refine_code_error_prompt_config=lean4/refinement_code_error "
),
generation_module="nemo_skills.inference.autoformalize",
cluster="slurm",
server_gpus="<NUM_GPUS>",
input_file="/workspace/data/problems.jsonl",
output_dir="/workspace/data/autoformalize_output",
server_type="vllm",
model="openai/gpt-oss-120b",
with_sandbox=True,
num_random_seeds=1,
)
```

The pipeline includes:

* **Initial Formalization**: LLM translates natural language to Lean 4 code
* **Compilation Check**: Lean 4 sandbox verifies syntactic correctness
* **Backtranslation & Verification**: Formal code is backtranslated and compared to the original
* **Iterative Refinement**: Up to 8 iterations to fix parsing, compilation, or semantic errors
* **Adaptive Reasoning**: Automatically reduces reasoning effort when hitting context limits

### Theorem Proving

The prover pipeline generates proofs for formalized statements with iterative error correction.
Input: formal statements from the autoformalization step.

=== "CLI"

```bash
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.prover \
--model=Goedel-LM/Goedel-Prover-V2-32B \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--server_args="--max-model-len 40960" \
--input_file=/workspace/data/formal_statements.jsonl \
--output_dir=/workspace/data/proofs_output \
--with_sandbox \
--num_random_seeds=1 \
++prompt_config=lean4/goedel-prover-v2 \
++inference.tokens_to_generate=38912 \
++inference.temperature=1.0 \
++inference.top_p=0.95 \
++refinement=True \
++refinement_max_turns=8 \
++remove_cot=True \
++n_pass=4 \
++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
++delete_wrong_turns=True \
++max_concurrent_requests=512
```

=== "Python"

```python
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=38912 "
"++inference.temperature=1.0 "
"++inference.top_p=0.95 "
"++prompt_config=lean4/goedel-prover-v2 "
"++refinement=True "
"++refinement_max_turns=8 "
"++remove_cot=True "
"++n_pass=4 "
"++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
"++delete_wrong_turns=True "
"++max_concurrent_requests=512 "
),
generation_module="nemo_skills.inference.prover",
cluster="slurm",
input_file="/workspace/data/formal_statements.jsonl",
output_dir="/workspace/data/proofs_output",
model="Goedel-LM/Goedel-Prover-V2-32B",
server_type="vllm",
server_gpus="<NUM_GPUS>",
server_args="--max-model-len 40960",
num_random_seeds=1,
with_sandbox=True,
)
```

The proving strategy includes:

* **Chain-of-thought removal**: Strips reasoning, keeping only formal proof code
* **Wrong turn deletion**: Discards failed attempts to prevent context pollution
* **Structured error feedback**: Compiler errors annotated with `<error></error>` tags
* **Pass@N with refinement**: Multiple independent attempts, each with iterative refinement

### Model Training

To fine-tune a model on the Nemotron-Math-Proofs dataset.
Input: processed SFT data from the theorem proving step.

=== "CLI"

```bash
ns sft_nemo_rl \
--cluster=slurm \
--expname=qwen3-8b-lean-sft \
--output_dir=/workspace/training/qwen3-8b-lean-sft \
--hf_model=Qwen/Qwen3-8B \
--training_data=/workspace/data/sft_data.jsonl \
--num_nodes=<NUM_NODES> \
--num_gpus=<NUM_GPUS> \
--backend=megatron \
++checkpointing.save_period=250 \
++sft.max_num_epochs=2000 \
++sft.max_num_steps=1000 \
++policy.megatron_cfg.tensor_model_parallel_size=4 \
++policy.megatron_cfg.pipeline_model_parallel_size=1 \
++policy.megatron_cfg.context_parallel_size=4 \
++policy.train_global_batch_size=2048 \
++policy.max_total_sequence_length=49152 \
++policy.megatron_cfg.optimizer.lr=1e-4 \
++policy.megatron_cfg.optimizer.min_lr=1e-4 \
++policy.megatron_cfg.scheduler.lr_warmup_iters=0
```

=== "Python"

```python
from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments

sft_nemo_rl(
ctx=wrap_arguments(
"++checkpointing.save_period=250 "
"++sft.max_num_epochs=2000 "
"++sft.max_num_steps=1000 "
"++policy.megatron_cfg.tensor_model_parallel_size=4 "
"++policy.megatron_cfg.pipeline_model_parallel_size=1 "
"++policy.megatron_cfg.context_parallel_size=4 "
"++policy.train_global_batch_size=2048 "
"++policy.max_total_sequence_length=49152 "
"++policy.megatron_cfg.optimizer.lr=1e-4 "
"++policy.megatron_cfg.optimizer.min_lr=1e-4 "
"++policy.megatron_cfg.scheduler.lr_warmup_iters=0 "
),
cluster="slurm",
expname="qwen3-8b-lean-sft",
backend="megatron",
output_dir="/workspace/training/qwen3-8b-lean-sft",
hf_model="Qwen/Qwen3-8B",
training_data="/workspace/data/sft_data.jsonl",
num_gpus="<NUM_GPUS>",
num_nodes="<NUM_NODES>",
)
```

### Model Evaluation on miniF2F

To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction):

=== "CLI"

```bash
ns eval \
--cluster=slurm \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--output_dir=/workspace/evals/nemotron-nano-3-minif2f \
--benchmarks=minif2f:32 \
--with_sandbox \
++prompt_config=lean4/goedel-prover-v2-nemotron \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
--extra_eval_args="++eval_config.timeout=600"
```

=== "Python"

```python
from nemo_skills.pipeline.cli import eval, wrap_arguments

eval(
ctx=wrap_arguments(
"++prompt_config=lean4/goedel-prover-v2-nemotron "
"++inference.tokens_to_generate=120000 "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
),
cluster="slurm",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
server_type="vllm",
server_gpus="<NUM_GPUS>",
benchmarks="minif2f:32",
output_dir="/workspace/evals/nemotron-nano-3-minif2f",
extra_eval_args="++eval_config.timeout=600",
with_sandbox=True,
)
```

To evaluate with self-correction (iterative refinement based on compiler feedback):

=== "CLI"

```bash
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.prover \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \
--output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \
--with_sandbox \
--num_random_seeds=32 \
++prompt_config=lean4/goedel-prover-v2-nemotron \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++refinement=True \
++refinement_max_turns=8 \
++remove_cot=True \
++n_pass=1 \
++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
++delete_wrong_turns=True \
++max_concurrent_requests=512
```

=== "Python"

```python
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=120000 "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
"++prompt_config=lean4/goedel-prover-v2-nemotron "
"++refinement=True "
"++refinement_max_turns=8 "
"++remove_cot=True "
"++n_pass=1 "
"++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
"++delete_wrong_turns=True "
"++max_concurrent_requests=512 "
),
generation_module="nemo_skills.inference.prover",
cluster="slurm",
input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl",
output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
server_type="vllm",
server_gpus="<NUM_GPUS>",
num_random_seeds=32,
with_sandbox=True,
)
```

To summarize evaluation results:

```bash
ns summarize_results /workspace/evals/nemotron-nano-3-minif2f/eval-results/minif2f --cluster slurm
```

## Known Limitations

* **Difficulty balance**: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages
* **Token/length normalization**: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors
* **Placeholder definitions**: Some formalizations use trivial placeholder definitions instead of mathlib (e.g., `def MyContinuous... Prop := True`)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ nav:
- tutorials/index.md
- Papers & Releases:
- releases/index.md
- Nemotron-Math-Proofs: releases/nemotronmathproofs/index.md
- OpenReasoning:
- releases/openreasoning/index.md
- Model Evaluation: releases/openreasoning/evaluation.md
Expand Down