-
Notifications
You must be signed in to change notification settings - Fork 164
Stepheng/nemotron math proofs docs #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,373 @@ | ||
| --- | ||
| date: 2025-12-15 | ||
| --- | ||
|
|
||
| # Nemotron-Math-Proofs | ||
|
|
||
| ## Dataset Overview | ||
|
|
||
| Using our pipelines we created [Nemotron-Math-Proofs-v1](https://huggingface.co/datasets/nvidia/Nemotron-Math-Proofs-v1), | ||
| a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems, | ||
| 550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories | ||
| culminating in Lean 4 proofs. | ||
|
|
||
| The dataset integrates human-authored problems with systematically generated formalizations and solution traces: | ||
|
|
||
| * **Natural Language Problems**: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow, | ||
| semantically deduplicated and decontaminated against popular benchmarks | ||
| * **Formal Statements**: ~550k Lean 4 theorem statements generated via autoformalization | ||
| * **Proof Trajectories**: ~900k verified reasoning traces and proofs | ||
|
|
||
| Each natural language problem is formalized by [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) into | ||
| Lean 4 theorem statements. Reasoning traces and proofs are generated by | ||
| [Goedel-Prover-V2-32B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) and verified by the Lean 4 compiler. | ||
|
|
||
| This dataset was used as part of the SFT data for [Nemotron-Nano-v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3). | ||
|
|
||
| ## Training Results | ||
|
|
||
| We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F: | ||
|
|
||
| | Model | pass@32 (no self-correction) | pass@32 (with self-correction) | | ||
| |-------|------------------------------|-------------------------------| | ||
| | Goedel-Prover-V2-8B | 84.6% | 86.7% | | ||
| | Qwen3-8B SFT on Nemotron-Math-Proofs-v1 | 85.3% | 90.2% | | ||
|
|
||
| Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F: | ||
|
|
||
| | Model | pass@32 (no self-correction) | pass@32 (with self-correction) | | ||
| |-------|------------------------------|-------------------------------| | ||
| | Nemotron-Nano-v3 | 79.92% | 86.89% | | ||
| | gpt-oss-20b | 43.03% | 59.42% | | ||
| | Qwen3-30B-A3B-Thinking | 16.80% | 32.37% | | ||
|
|
||
| ## How to Reproduce | ||
|
|
||
| Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation. | ||
|
|
||
| !!! note | ||
|
|
||
| These commands assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md). | ||
| Adjust paths and cluster settings according to your environment. | ||
|
|
||
| ### Autoformalization | ||
|
|
||
| The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative | ||
| refinement process with backtranslation verification. The input is natural language math problems—see | ||
| [OpenMathReasoning dataset construction](../openmathreasoning/dataset.md) for how to prepare these. | ||
|
|
||
| === "CLI" | ||
|
|
||
| ```bash | ||
| ns generate \ | ||
| --cluster=slurm \ | ||
| --generation_module=nemo_skills.inference.autoformalize \ | ||
| --model=openai/gpt-oss-120b \ | ||
| --server_type=vllm \ | ||
| --server_gpus=<NUM_GPUS> \ | ||
| --input_file=/workspace/data/problems.jsonl \ | ||
| --output_dir=/workspace/data/autoformalize_output \ | ||
| --with_sandbox \ | ||
| --num_random_seeds=1 \ | ||
| ++prompt_config=lean4/autoformalization \ | ||
| ++inference.tokens_to_generate=120000 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=1.0 \ | ||
| ++adaptive_reasoning=True \ | ||
| ++refinement=True \ | ||
| ++judge_enabled=True \ | ||
| ++refinement_max_turns=8 \ | ||
| ++backtranslation_prompt_config=lean4/backtranslation \ | ||
| ++judge_prompt_config=lean4/judge-backtranslation \ | ||
| ++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \ | ||
| ++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \ | ||
| ++refine_code_error_prompt_config=lean4/refinement_code_error | ||
| ``` | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from nemo_skills.pipeline.cli import generate, wrap_arguments | ||
|
|
||
| generate( | ||
| ctx=wrap_arguments( | ||
| "++inference.tokens_to_generate=120000 " | ||
| "++prompt_config=lean4/autoformalization " | ||
| "++inference.temperature=1.0 " | ||
| "++inference.top_p=1.0 " | ||
| "++adaptive_reasoning=True " | ||
| "++refinement=True " | ||
| "++judge_enabled=True " | ||
| "++refinement_max_turns=8 " | ||
| "++backtranslation_prompt_config=lean4/backtranslation " | ||
| "++judge_prompt_config=lean4/judge-backtranslation " | ||
| "++refine_consistent_error_prompt_config=lean4/refinement_consistent_error " | ||
| "++refine_parsing_error_prompt_config=lean4/refinement_parsing_error " | ||
| "++refine_code_error_prompt_config=lean4/refinement_code_error " | ||
| ), | ||
| generation_module="nemo_skills.inference.autoformalize", | ||
| cluster="slurm", | ||
| server_gpus="<NUM_GPUS>", | ||
| input_file="/workspace/data/problems.jsonl", | ||
| output_dir="/workspace/data/autoformalize_output", | ||
| server_type="vllm", | ||
| model="openai/gpt-oss-120b", | ||
| with_sandbox=True, | ||
| num_random_seeds=1, | ||
| ) | ||
| ``` | ||
|
|
||
| The pipeline includes: | ||
|
|
||
| * **Initial Formalization**: LLM translates natural language to Lean 4 code | ||
| * **Compilation Check**: Lean 4 sandbox verifies syntactic correctness | ||
| * **Backtranslation & Verification**: Formal code is backtranslated and compared to the original | ||
| * **Iterative Refinement**: Up to 8 iterations to fix parsing, compilation, or semantic errors | ||
| * **Adaptive Reasoning**: Automatically reduces reasoning effort when hitting context limits | ||
|
|
||
| ### Theorem Proving | ||
|
|
||
| The prover pipeline generates proofs for formalized statements with iterative error correction. | ||
| Input: formal statements from the autoformalization step. | ||
|
|
||
| === "CLI" | ||
|
|
||
| ```bash | ||
| ns generate \ | ||
| --cluster=slurm \ | ||
| --generation_module=nemo_skills.inference.prover \ | ||
| --model=Goedel-LM/Goedel-Prover-V2-32B \ | ||
| --server_type=vllm \ | ||
| --server_gpus=<NUM_GPUS> \ | ||
| --server_args="--max-model-len 40960" \ | ||
| --input_file=/workspace/data/formal_statements.jsonl \ | ||
| --output_dir=/workspace/data/proofs_output \ | ||
| --with_sandbox \ | ||
| --num_random_seeds=1 \ | ||
| ++prompt_config=lean4/goedel-prover-v2 \ | ||
| ++inference.tokens_to_generate=38912 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=0.95 \ | ||
| ++refinement=True \ | ||
| ++refinement_max_turns=8 \ | ||
| ++remove_cot=True \ | ||
| ++n_pass=4 \ | ||
| ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \ | ||
| ++delete_wrong_turns=True \ | ||
| ++max_concurrent_requests=512 | ||
| ``` | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from nemo_skills.pipeline.cli import generate, wrap_arguments | ||
|
|
||
| generate( | ||
| ctx=wrap_arguments( | ||
| "++inference.tokens_to_generate=38912 " | ||
| "++inference.temperature=1.0 " | ||
| "++inference.top_p=0.95 " | ||
| "++prompt_config=lean4/goedel-prover-v2 " | ||
| "++refinement=True " | ||
| "++refinement_max_turns=8 " | ||
| "++remove_cot=True " | ||
| "++n_pass=4 " | ||
| "++refinement_prompt_config=lean4/goedel-prover-v2-refinement " | ||
| "++delete_wrong_turns=True " | ||
| "++max_concurrent_requests=512 " | ||
| ), | ||
| generation_module="nemo_skills.inference.prover", | ||
| cluster="slurm", | ||
| input_file="/workspace/data/formal_statements.jsonl", | ||
| output_dir="/workspace/data/proofs_output", | ||
| model="Goedel-LM/Goedel-Prover-V2-32B", | ||
| server_type="vllm", | ||
| server_gpus="<NUM_GPUS>", | ||
| server_args="--max-model-len 40960", | ||
| num_random_seeds=1, | ||
| with_sandbox=True, | ||
| ) | ||
| ``` | ||
|
|
||
| The proving strategy includes: | ||
|
|
||
| * **Chain-of-thought removal**: Strips reasoning, keeping only formal proof code | ||
| * **Wrong turn deletion**: Discards failed attempts to prevent context pollution | ||
| * **Structured error feedback**: Compiler errors annotated with `<error></error>` tags | ||
| * **Pass@N with refinement**: Multiple independent attempts, each with iterative refinement | ||
|
|
||
| ### Model Training | ||
|
|
||
| To fine-tune a model on the Nemotron-Math-Proofs dataset. | ||
| Input: processed SFT data from the theorem proving step. | ||
|
|
||
| === "CLI" | ||
|
|
||
| ```bash | ||
| ns sft_nemo_rl \ | ||
| --cluster=slurm \ | ||
| --expname=qwen3-8b-lean-sft \ | ||
| --output_dir=/workspace/training/qwen3-8b-lean-sft \ | ||
| --hf_model=Qwen/Qwen3-8B \ | ||
| --training_data=/workspace/data/sft_data.jsonl \ | ||
| --num_nodes=<NUM_NODES> \ | ||
| --num_gpus=<NUM_GPUS> \ | ||
| --backend=megatron \ | ||
| ++checkpointing.save_period=250 \ | ||
| ++sft.max_num_epochs=2000 \ | ||
| ++sft.max_num_steps=1000 \ | ||
| ++policy.megatron_cfg.tensor_model_parallel_size=4 \ | ||
| ++policy.megatron_cfg.pipeline_model_parallel_size=1 \ | ||
| ++policy.megatron_cfg.context_parallel_size=4 \ | ||
| ++policy.train_global_batch_size=2048 \ | ||
| ++policy.max_total_sequence_length=49152 \ | ||
| ++policy.megatron_cfg.optimizer.lr=1e-4 \ | ||
| ++policy.megatron_cfg.optimizer.min_lr=1e-4 \ | ||
| ++policy.megatron_cfg.scheduler.lr_warmup_iters=0 | ||
| ``` | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments | ||
|
|
||
| sft_nemo_rl( | ||
| ctx=wrap_arguments( | ||
| "++checkpointing.save_period=250 " | ||
| "++sft.max_num_epochs=2000 " | ||
| "++sft.max_num_steps=1000 " | ||
| "++policy.megatron_cfg.tensor_model_parallel_size=4 " | ||
| "++policy.megatron_cfg.pipeline_model_parallel_size=1 " | ||
| "++policy.megatron_cfg.context_parallel_size=4 " | ||
| "++policy.train_global_batch_size=2048 " | ||
| "++policy.max_total_sequence_length=49152 " | ||
| "++policy.megatron_cfg.optimizer.lr=1e-4 " | ||
| "++policy.megatron_cfg.optimizer.min_lr=1e-4 " | ||
| "++policy.megatron_cfg.scheduler.lr_warmup_iters=0 " | ||
| ), | ||
| cluster="slurm", | ||
| expname="qwen3-8b-lean-sft", | ||
| backend="megatron", | ||
| output_dir="/workspace/training/qwen3-8b-lean-sft", | ||
| hf_model="Qwen/Qwen3-8B", | ||
| training_data="/workspace/data/sft_data.jsonl", | ||
| num_gpus="<NUM_GPUS>", | ||
| num_nodes="<NUM_NODES>", | ||
| ) | ||
| ``` | ||
|
|
||
| ### Model Evaluation on miniF2F | ||
|
|
||
| To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction): | ||
|
|
||
| === "CLI" | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=slurm \ | ||
| --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ | ||
| --server_type=vllm \ | ||
| --server_gpus=<NUM_GPUS> \ | ||
| --output_dir=/workspace/evals/nemotron-nano-3-minif2f \ | ||
| --benchmarks=minif2f:32 \ | ||
| --with_sandbox \ | ||
| ++prompt_config=lean4/goedel-prover-v2-nemotron \ | ||
| ++inference.tokens_to_generate=120000 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=1.0 \ | ||
| --extra_eval_args="++eval_config.timeout=600" | ||
| ``` | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from nemo_skills.pipeline.cli import eval, wrap_arguments | ||
|
|
||
| eval( | ||
| ctx=wrap_arguments( | ||
| "++prompt_config=lean4/goedel-prover-v2-nemotron " | ||
| "++inference.tokens_to_generate=120000 " | ||
| "++inference.temperature=1.0 " | ||
| "++inference.top_p=1.0 " | ||
| ), | ||
| cluster="slurm", | ||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||
| server_type="vllm", | ||
| server_gpus="<NUM_GPUS>", | ||
| benchmarks="minif2f:32", | ||
| output_dir="/workspace/evals/nemotron-nano-3-minif2f", | ||
| extra_eval_args="++eval_config.timeout=600", | ||
| with_sandbox=True, | ||
| ) | ||
| ``` | ||
|
|
||
| To evaluate with self-correction (iterative refinement based on compiler feedback): | ||
|
|
||
| === "CLI" | ||
|
|
||
| ```bash | ||
| ns generate \ | ||
| --cluster=slurm \ | ||
| --generation_module=nemo_skills.inference.prover \ | ||
| --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ | ||
| --server_type=vllm \ | ||
| --server_gpus=<NUM_GPUS> \ | ||
| --input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \ | ||
| --output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \ | ||
| --with_sandbox \ | ||
| --num_random_seeds=32 \ | ||
| ++prompt_config=lean4/goedel-prover-v2-nemotron \ | ||
| ++inference.tokens_to_generate=120000 \ | ||
| ++inference.temperature=1.0 \ | ||
| ++inference.top_p=1.0 \ | ||
| ++refinement=True \ | ||
| ++refinement_max_turns=8 \ | ||
| ++remove_cot=True \ | ||
| ++n_pass=1 \ | ||
| ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \ | ||
| ++delete_wrong_turns=True \ | ||
| ++max_concurrent_requests=512 | ||
| ``` | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from nemo_skills.pipeline.cli import generate, wrap_arguments | ||
|
|
||
| generate( | ||
| ctx=wrap_arguments( | ||
| "++inference.tokens_to_generate=120000 " | ||
| "++inference.temperature=1.0 " | ||
| "++inference.top_p=1.0 " | ||
| "++prompt_config=lean4/goedel-prover-v2-nemotron " | ||
| "++refinement=True " | ||
| "++refinement_max_turns=8 " | ||
| "++remove_cot=True " | ||
| "++n_pass=1 " | ||
| "++refinement_prompt_config=lean4/goedel-prover-v2-refinement " | ||
| "++delete_wrong_turns=True " | ||
| "++max_concurrent_requests=512 " | ||
| ), | ||
| generation_module="nemo_skills.inference.prover", | ||
| cluster="slurm", | ||
| input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl", | ||
| output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction", | ||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||
| server_type="vllm", | ||
| server_gpus="<NUM_GPUS>", | ||
| num_random_seeds=32, | ||
| with_sandbox=True, | ||
| ) | ||
| ``` | ||
|
|
||
| To summarize evaluation results: | ||
|
|
||
| ```bash | ||
| ns summarize_results /workspace/evals/nemotron-nano-3-minif2f/eval-results/minif2f --cluster slurm | ||
| ``` | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| * **Difficulty balance**: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages | ||
| * **Token/length normalization**: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors | ||
| * **Placeholder definitions**: Some formalizations use trivial placeholder definitions instead of mathlib (e.g., `def MyContinuous... Prop := True`) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's going to be better, but consider splitting into multiple pages and linking here