NVIDIA-NeMo · stephencge · Dec 16, 2025 · Dec 15, 2025 · Dec 15, 2025 · Dec 16, 2025
diff --git a/docs/releases/nemotronmathproofs/index.md b/docs/releases/nemotronmathproofs/index.md
@@ -0,0 +1,373 @@
+---
+date: 2025-12-15
+---
+
+# Nemotron-Math-Proofs
+
+## Dataset Overview
+
+Using our pipelines we created [Nemotron-Math-Proofs-v1](https://huggingface.co/datasets/nvidia/Nemotron-Math-Proofs-v1),
+a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems,
+550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories
+culminating in Lean 4 proofs.
+
+The dataset integrates human-authored problems with systematically generated formalizations and solution traces:
+
+* **Natural Language Problems**: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow,
+  semantically deduplicated and decontaminated against popular benchmarks
+* **Formal Statements**: ~550k Lean 4 theorem statements generated via autoformalization
+* **Proof Trajectories**: ~900k verified reasoning traces and proofs
+
+Each natural language problem is formalized by [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) into
+Lean 4 theorem statements. Reasoning traces and proofs are generated by
+[Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) and verified by the Lean 4 compiler.
+
+This dataset was used as part of the SFT data for [Nemotron-Nano-v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3).
+
+## Training Results
+
+We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F:
+
+| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
+|-------|------------------------------|-------------------------------|
+| Goedel-Prover-V2-8B | 84.6% | 86.7% |
+| Qwen3-8B SFT on Nemotron-Math-Proofs-v1 | 85.3% | 90.2% |
+
+Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F:
+
+| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
+|-------|------------------------------|-------------------------------|
+| Nemotron-Nano-v3 | 79.92% | 86.89% |
+| gpt-oss-20b | 43.03% | 59.42% |
+| Qwen3-30B-A3B-Thinking | 16.80% | 32.37% |
+
+## How to Reproduce
+
+Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation.
+
+!!! note
+
+    These commands assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md).
+    Adjust paths and cluster settings according to your environment.
+
+### Autoformalization
+
+The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative
+refinement process with backtranslation verification. The input is natural language math problems—see
+[OpenMathReasoning dataset construction](../openmathreasoning/dataset.md) for how to prepare these.
+
+=== "CLI"
+
+    ```bash
+    ns generate \
+        --cluster=slurm \
+        --generation_module=nemo_skills.inference.autoformalize \
+        --model=openai/gpt-oss-120b \
+        --server_type=vllm \
+        --server_gpus=<NUM_GPUS> \
+        --input_file=/workspace/data/problems.jsonl \
+        --output_dir=/workspace/data/autoformalize_output \
+        --with_sandbox \
+        --num_random_seeds=1 \
+        ++prompt_config=lean4/autoformalization \
+        ++inference.tokens_to_generate=120000 \
+        ++inference.temperature=1.0 \
+        ++inference.top_p=1.0 \
+        ++adaptive_reasoning=True \
+        ++refinement=True \
+        ++judge_enabled=True \
+        ++refinement_max_turns=8 \
+        ++backtranslation_prompt_config=lean4/backtranslation \
+        ++judge_prompt_config=lean4/judge-backtranslation \
+        ++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \
+        ++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \
+        ++refine_code_error_prompt_config=lean4/refinement_code_error
+    ```
+
+=== "Python"
+
+    ```python
+    from nemo_skills.pipeline.cli import generate, wrap_arguments
+
+    generate(
+        ctx=wrap_arguments(
+            "++inference.tokens_to_generate=120000 "
+            "++prompt_config=lean4/autoformalization "
+            "++inference.temperature=1.0 "
+            "++inference.top_p=1.0 "
+            "++adaptive_reasoning=True "
+            "++refinement=True "
+            "++judge_enabled=True "
+            "++refinement_max_turns=8 "
+            "++backtranslation_prompt_config=lean4/backtranslation "
+            "++judge_prompt_config=lean4/judge-backtranslation "
+            "++refine_consistent_error_prompt_config=lean4/refinement_consistent_error "
+            "++refine_parsing_error_prompt_config=lean4/refinement_parsing_error "
+            "++refine_code_error_prompt_config=lean4/refinement_code_error "
+        ),
+        generation_module="nemo_skills.inference.autoformalize",
+        cluster="slurm",
+        server_gpus="<NUM_GPUS>",
+        input_file="/workspace/data/problems.jsonl",
+        output_dir="/workspace/data/autoformalize_output",
+        server_type="vllm",
+        model="openai/gpt-oss-120b",
+        with_sandbox=True,
+        num_random_seeds=1,
+    )
+    ```
+
+The pipeline includes:
+
+* **Initial Formalization**: LLM translates natural language to Lean 4 code
+* **Compilation Check**: Lean 4 sandbox verifies syntactic correctness
+* **Backtranslation & Verification**: Formal code is backtranslated and compared to the original
+* **Iterative Refinement**: Up to 8 iterations to fix parsing, compilation, or semantic errors
+* **Adaptive Reasoning**: Automatically reduces reasoning effort when hitting context limits
+
+### Theorem Proving
+
+The prover pipeline generates proofs for formalized statements with iterative error correction.
+Input: formal statements from the autoformalization step.
+
+=== "CLI"
+
+    ```bash
+    ns generate \
+        --cluster=slurm \
+        --generation_module=nemo_skills.inference.prover \
+        --model=Goedel-LM/Goedel-Prover-V2-32B \
+        --server_type=vllm \
+        --server_gpus=<NUM_GPUS> \
+        --server_args="--max-model-len 40960" \
+        --input_file=/workspace/data/formal_statements.jsonl \
+        --output_dir=/workspace/data/proofs_output \
+        --with_sandbox \
+        --num_random_seeds=1 \
+        ++prompt_config=lean4/goedel-prover-v2 \
+        ++inference.tokens_to_generate=38912 \
+        ++inference.temperature=1.0 \
+        ++inference.top_p=0.95 \
+        ++refinement=True \
+        ++refinement_max_turns=8 \
+        ++remove_cot=True \
+        ++n_pass=4 \
+        ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
+        ++delete_wrong_turns=True \
+        ++max_concurrent_requests=512
+    ```
+
+=== "Python"
+
+    ```python
+    from nemo_skills.pipeline.cli import generate, wrap_arguments
+
+    generate(
+        ctx=wrap_arguments(
+            "++inference.tokens_to_generate=38912 "
+            "++inference.temperature=1.0 "
+            "++inference.top_p=0.95 "
+            "++prompt_config=lean4/goedel-prover-v2 "
+            "++refinement=True "
+            "++refinement_max_turns=8 "
+            "++remove_cot=True "
+            "++n_pass=4 "
+            "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
+            "++delete_wrong_turns=True "
+            "++max_concurrent_requests=512 "
+        ),
+        generation_module="nemo_skills.inference.prover",
+        cluster="slurm",
+        input_file="/workspace/data/formal_statements.jsonl",
+        output_dir="/workspace/data/proofs_output",
+        model="Goedel-LM/Goedel-Prover-V2-32B",
+        server_type="vllm",
+        server_gpus="<NUM_GPUS>",
+        server_args="--max-model-len 40960",
+        num_random_seeds=1,
+        with_sandbox=True,
+    )
+    ```
+
+The proving strategy includes:
+
+* **Chain-of-thought removal**: Strips reasoning, keeping only formal proof code
+* **Wrong turn deletion**: Discards failed attempts to prevent context pollution
+* **Structured error feedback**: Compiler errors annotated with `<error></error>` tags
+* **Pass@N with refinement**: Multiple independent attempts, each with iterative refinement
+
+### Model Training
+
+To fine-tune a model on the Nemotron-Math-Proofs dataset.
+Input: processed SFT data from the theorem proving step.
+
+=== "CLI"
+
+    ```bash
+    ns sft_nemo_rl \
+        --cluster=slurm \
+        --expname=qwen3-8b-lean-sft \
+        --output_dir=/workspace/training/qwen3-8b-lean-sft \
+        --hf_model=Qwen/Qwen3-8B \
+        --training_data=/workspace/data/sft_data.jsonl \
+        --num_nodes=<NUM_NODES> \
+        --num_gpus=<NUM_GPUS> \
+        --backend=megatron \
+        ++checkpointing.save_period=250 \
+        ++sft.max_num_epochs=2000 \
+        ++sft.max_num_steps=1000 \
+        ++policy.megatron_cfg.tensor_model_parallel_size=4 \
+        ++policy.megatron_cfg.pipeline_model_parallel_size=1 \
+        ++policy.megatron_cfg.context_parallel_size=4 \
+        ++policy.train_global_batch_size=2048 \
+        ++policy.max_total_sequence_length=49152 \
+        ++policy.megatron_cfg.optimizer.lr=1e-4 \
+        ++policy.megatron_cfg.optimizer.min_lr=1e-4 \
+        ++policy.megatron_cfg.scheduler.lr_warmup_iters=0
+    ```
+
+=== "Python"
+
+    ```python
+    from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments
+
+    sft_nemo_rl(
+        ctx=wrap_arguments(
+            "++checkpointing.save_period=250 "
+            "++sft.max_num_epochs=2000 "
+            "++sft.max_num_steps=1000 "
+            "++policy.megatron_cfg.tensor_model_parallel_size=4 "
+            "++policy.megatron_cfg.pipeline_model_parallel_size=1 "
+            "++policy.megatron_cfg.context_parallel_size=4 "
+            "++policy.train_global_batch_size=2048 "
+            "++policy.max_total_sequence_length=49152 "
+            "++policy.megatron_cfg.optimizer.lr=1e-4 "
+            "++policy.megatron_cfg.optimizer.min_lr=1e-4 "
+            "++policy.megatron_cfg.scheduler.lr_warmup_iters=0 "
+        ),
+        cluster="slurm",
+        expname="qwen3-8b-lean-sft",
+        backend="megatron",
+        output_dir="/workspace/training/qwen3-8b-lean-sft",
+        hf_model="Qwen/Qwen3-8B",
+        training_data="/workspace/data/sft_data.jsonl",
+        num_gpus="<NUM_GPUS>",
+        num_nodes="<NUM_NODES>",
+    )
+    ```
+
+### Model Evaluation on miniF2F
+
+To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction):
+
+=== "CLI"
+
+    ```bash
+    ns eval \
+        --cluster=slurm \
+        --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+        --server_type=vllm \
+        --server_gpus=<NUM_GPUS> \
+        --output_dir=/workspace/evals/nemotron-nano-3-minif2f \
+        --benchmarks=minif2f:32 \
+        --with_sandbox \
+        ++prompt_config=lean4/goedel-prover-v2-nemotron \
+        ++inference.tokens_to_generate=120000 \
+        ++inference.temperature=1.0 \
+        ++inference.top_p=1.0 \
+        --extra_eval_args="++eval_config.timeout=600"
+    ```
+
+=== "Python"
+
+    ```python
+    from nemo_skills.pipeline.cli import eval, wrap_arguments
+
+    eval(
+        ctx=wrap_arguments(
+            "++prompt_config=lean4/goedel-prover-v2-nemotron "
+            "++inference.tokens_to_generate=120000 "
+            "++inference.temperature=1.0 "
+            "++inference.top_p=1.0 "
+        ),
+        cluster="slurm",
+        model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+        server_type="vllm",
+        server_gpus="<NUM_GPUS>",
+        benchmarks="minif2f:32",
+        output_dir="/workspace/evals/nemotron-nano-3-minif2f",
+        extra_eval_args="++eval_config.timeout=600",
+        with_sandbox=True,
+    )
+    ```
+
+To evaluate with self-correction (iterative refinement based on compiler feedback):
+
+=== "CLI"
+
+    ```bash
+    ns generate \
+        --cluster=slurm \
+        --generation_module=nemo_skills.inference.prover \
+        --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+        --server_type=vllm \
+        --server_gpus=<NUM_GPUS> \
+        --input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \
+        --output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \
+        --with_sandbox \
+        --num_random_seeds=32 \
+        ++prompt_config=lean4/goedel-prover-v2-nemotron \
+        ++inference.tokens_to_generate=120000 \
+        ++inference.temperature=1.0 \
+        ++inference.top_p=1.0 \
+        ++refinement=True \
+        ++refinement_max_turns=8 \
+        ++remove_cot=True \
+        ++n_pass=1 \
+        ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
+        ++delete_wrong_turns=True \
+        ++max_concurrent_requests=512
+    ```
+
+=== "Python"
+
+    ```python
+    from nemo_skills.pipeline.cli import generate, wrap_arguments
+
+    generate(
+        ctx=wrap_arguments(
+            "++inference.tokens_to_generate=120000 "
+            "++inference.temperature=1.0 "
+            "++inference.top_p=1.0 "
+            "++prompt_config=lean4/goedel-prover-v2-nemotron "
+            "++refinement=True "
+            "++refinement_max_turns=8 "
+            "++remove_cot=True "
+            "++n_pass=1 "
+            "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
+            "++delete_wrong_turns=True "
+            "++max_concurrent_requests=512 "
+        ),
+        generation_module="nemo_skills.inference.prover",
+        cluster="slurm",
+        input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl",
+        output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction",
+        model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+        server_type="vllm",
+        server_gpus="<NUM_GPUS>",
+        num_random_seeds=32,
+        with_sandbox=True,
+    )
+    ```
+
+To summarize evaluation results:
+
+```bash
+ns summarize_results /workspace/evals/nemotron-nano-3-minif2f/eval-results/minif2f --cluster slurm
+```
+
+## Known Limitations
+
+* **Difficulty balance**: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages
+* **Token/length normalization**: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors
+* **Placeholder definitions**: Some formalizations use trivial placeholder definitions instead of mathlib (e.g., `def MyContinuous... Prop := True`)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -91,6 +91,7 @@ nav:
     - tutorials/index.md
   - Papers & Releases:
     - releases/index.md
+    - Nemotron-Math-Proofs: releases/nemotronmathproofs/index.md
     - OpenReasoning:
       - releases/openreasoning/index.md
       - Model Evaluation: releases/openreasoning/evaluation.md

diff --git a/.../lean4/deepseek-R1-autoformalization.yaml → ...rompt/config/lean4/autoformalization.yaml b/.../lean4/deepseek-R1-autoformalization.yaml → ...rompt/config/lean4/autoformalization.yaml
diff --git a/...ig/lean4/deepseek-R1-backtranslation.yaml → .../prompt/config/lean4/backtranslation.yaml b/...ig/lean4/deepseek-R1-backtranslation.yaml → .../prompt/config/lean4/backtranslation.yaml
diff --git a/...n4/deepseek-R1-judge-backtranslation.yaml → ...t/config/lean4/judge-backtranslation.yaml b/...n4/deepseek-R1-judge-backtranslation.yaml → ...t/config/lean4/judge-backtranslation.yaml