Stepheng/prover cleanup by stephencge · Pull Request #1078 · NVIDIA-NeMo/Skills

stephencge · 2025-12-05T01:58:46Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced Lean4 automated theorem proving capability with multi-stage refinement support
- Added new CLI generate command to orchestrate and distribute proof generation tasks
- Added Lean4 code utilities for proof parsing, error extraction, and optimization
- Added multiple prompt configurations for diverse Lean4 proving workflows and refinement strategies

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

coderabbitai · 2025-12-05T02:05:03Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Lean 4 prover system, including new utilities for Lean4 text processing and error handling, a ProverTask class implementing multi-stage proof generation with refinement capabilities, a CLI command for distributed task execution, and associated prompt configurations.

Changes

Cohort / File(s)	Change Summary
Factory & Core Integration `nemo_skills/inference/factory.py`	Added new enum member `GenerationType.prover = "prover"` and corresponding module mapping in `GENERATION_MODULE_MAP` pointing to `nemo_skills.inference.prover`.
Lean4 Utilities `nemo_skills/inference/lean4_utils.py`	New module providing 9 utility functions for Lean4 text processing and error reporting: `remove_comments`, `move_imports_to_beginning`, `return_theorem_to_prove`, `return_theorem_to_replace`, `replace_statement_in_proof`, `refine_by_sorry`, `extract_code`, `parse_error`, and `get_error_str`; includes module-level logger and robust regex-based string extraction.
Prover Task Implementation `nemo_skills/inference/prover.py`	New module defining `ProverConfig` (extends `GenerateSolutionsConfig` with Lean4-specific parameters) and `ProverTask` (subclass of `GenerationTask`) implementing multi-stage proof generation loop with adaptive reasoning, iterative refinement, SFT-friendly prompt management, sandbox integration, and Lean4 code execution. Exposes `GENERATION_TASK_CLASS`, `generate()` Hydra entry point, and `HELP_MESSAGE`.
Pipeline Integration `nemo_skills/pipeline/prover.py`	New CLI command `generate` with extensive options (cluster, input/output paths, generation type/module, model, server config, chunking, seeds, logging, dependencies, mounting, pre/post processing). Orchestrates distributed task execution via dynamic module loading, seed computation, chunking support, wandb logging, and experiment scheduling.
Lean4 Prompt Configurations `nemo_skills/prompt/config/lean4/-nemotron.yaml` `nemo_skills/prompt/config/lean4/-refinement*.yaml` `nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml`	Five new YAML configuration files: `formal-proof-deepseek-prover-v2-nemotron.yaml`, `goedel-prover-v2-nemotron.yaml`, `goedel-prover-v2-refinement-nemotron.yaml`, `goedel-prover-v2-refinement.yaml`, and `goedel-prover-v2.yaml`. Provide Lean4-specific prompting guidance for theorem completion with structured workflows (proof plan followed by code generation, error analysis for refinement).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as Pipeline CLI
    participant Task as ProverTask
    participant LLM
    participant Sandbox as Sandbox Executor
    participant Logger as Logger

    User->>CLI: invoke generate command
    CLI->>CLI: resolve generation module<br/>(ProverTask)
    CLI->>CLI: compute seeds & chunks
    CLI->>Task: create ProverTask instance<br/>with ProverConfig

    loop for each datapoint
        Task->>Task: _single_data_point_generate()
        
        loop for each pass (pass_at_N)
            Task->>LLM: get initial generation<br/>(with adaptive reasoning)
            LLM-->>Task: raw proof generation
            
            Task->>Task: extract & parse<br/>Lean4 code
            
            alt parsing succeeds
                Task->>Sandbox: execute proof code
                Sandbox-->>Task: execution result
                
                alt proof succeeds
                    Task->>Logger: log success
                    Note over Task: Save result
                else execution error/timeout
                    Task->>Logger: log error
                    Task->>Task: parse error message
                    
                    alt refinement enabled
                        Task->>LLM: generate refinement<br/>(with error context)
                        LLM-->>Task: refined proof
                        Task->>Task: replace proof in statement
                        Task->>Sandbox: re-execute refined proof
                        Sandbox-->>Task: execution result
                    end
                end
            else parsing fails
                Task->>Logger: log parse error
                alt refinement enabled
                    Task->>LLM: request correction<br/>(with parse error)
                    LLM-->>Task: corrected proof
                end
            end
        end
        
        Task->>Logger: finalize datapoint results
    end
    
    CLI->>CLI: run experiment via pipeline_utils
    CLI-->>User: return results/tasks

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

nemo_skills/inference/prover.py: Largest and most logic-dense file with multi-stage generation loop, adaptive reasoning, refinement flow, error handling, and SFT-specific prompt management. Requires careful review of control flow and interaction with sandbox/LLM.
nemo_skills/pipeline/prover.py: Extensive CLI with 20+ parameters, dynamic module loading, seed/chunk computation, and experiment orchestration. Complex interaction between multiple utilities and configuration resolution.
nemo_skills/inference/lean4_utils.py: Nine utility functions with regex-based parsing, string manipulation, and error formatting. Needs verification of edge cases in extraction and error message rendering.
Cross-module integration: ProverTask's use of lean4_utils functions, factory registration, and pipeline's dynamic class loading require understanding the full integration flow.
Configuration files: While simple, verify that prompt placeholders and workflows align with ProverTask's expectations.

Suggested reviewers

Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Stepheng/prover cleanup' is vague and uses generic terminology that doesn't convey meaningful information about what was actually changed. It reads like a branch name rather than describing the specific changes made.	Provide a more descriptive title that clearly summarizes the main changes, such as 'Add Lean4 prover task with refinement capabilities' or 'Introduce ProverTask for iterative Lean4 proof generation'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch stepheng/prover-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

nemo_skills/inference/lean4_utils.py (2)
101-117: Consider returning None instead of string "None".

The function returns the literal string "None" when no code block is found, which can be confusing and error-prone for callers who might check for Python's None. Consider returning None instead:
-    return "None"
+    return None
Callers would then use if result is None: instead of if result == "None":.

Also note: the hardcoded import_head may not suit all Lean4 use cases. Consider making it configurable if needed for different proof contexts.

127-130: Hardcoded path /lean4/my_project/ may limit portability.

The error pattern assumes Lean4 errors come from a specific /lean4/my_project/ path. This works in a containerized environment but may fail in other setups. Consider making the path prefix configurable or using a more flexible pattern if this utility needs to work outside the expected container environment.
nemo_skills/pipeline/prover.py (1)
253-262: Variable shadowing: chunk_ids reused in loop.

At line 254, chunk_ids from the outer scope (computed at line 198-200) is shadowed by iterating over remaining_jobs.items(). This works but could cause confusion. Consider renaming:
-        for seed_idx, (seed, chunk_ids) in enumerate(remaining_jobs.items()):
+        for seed_idx, (seed, seed_chunk_ids) in enumerate(remaining_jobs.items()):
             if wandb_parameters:
                 # no need for chunks as it will run after merging
                 wandb_parameters["samples_file"] = pipeline_utils.get_chunked_rs_filename(
                     output_dir,
                     random_seed=seed,
                     chunk_id=None,
                 )
-            for chunk_id in chunk_ids:
+            for chunk_id in seed_chunk_ids:
nemo_skills/inference/prover.py (3)
104-106: Add validation for tokenizer path.

If both cfg.tokenizer is None and cfg.server.get("model") returns None, AutoTokenizer.from_pretrained(None) will raise an unclear error. Consider adding explicit validation.
         # Initialize tokenizer for chat template application
         tokenizer_path = self.cfg.tokenizer or self.cfg.server.get("model")
+        if tokenizer_path is None:
+            raise ValueError("tokenizer or server.model must be specified for chat template application")
         self.hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
286-304: Dead code: unreachable else branch.

The else branch at lines 299-304 is unreachable because the preceding conditions (code == "None" and "**Error**" in full_code) are exhaustive for the outer if condition.
             if code == "None" or "**Error**" in full_code:
                 if code == "None":
                     execution_result = {
                         "process_status": "failed",
                         "stderr": "",
                         "stdout": "Parsing error. Cannot parse the code from output. Please try again and write the code in the format of ```lean4\n<code>\n```",
                     }
-                elif "**Error**" in full_code:
+                else:  # "**Error**" in full_code
                     execution_result = {
                         "process_status": "failed",
                         "stderr": "",
                         "stdout": full_code,
                     }
-                else:
-                    execution_result = {
-                        "process_status": "failed",
-                        "stderr": "",
-                        "stdout": "Unknown error when parsing code.",
-                    }
312-313: Consider making sandbox timeout configurable.

The timeout of 600.0 seconds is hardcoded. For long-running proofs or faster iteration during development, this could be a configuration parameter.
 @nested_dataclass(kw_only=True)
 class ProverConfig(GenerateSolutionsConfig):
     max_tokens: int = 40960  # model max tokens
     n_pass: int = 1  # number of passes to run the prover
+    sandbox_timeout: float = 600.0  # timeout for Lean4 code execution in seconds
Then use self.cfg.sandbox_timeout instead of the hardcoded value.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc1dcc and b5d527b.

📒 Files selected for processing (9)

nemo_skills/inference/factory.py (1 hunks)
nemo_skills/inference/lean4_utils.py (1 hunks)
nemo_skills/inference/prover.py (1 hunks)
nemo_skills/pipeline/prover.py (1 hunks)
nemo_skills/prompt/config/lean4/formal-proof-deepseek-prover-v2-nemotron.yaml (1 hunks)
nemo_skills/prompt/config/lean4/goedel-prover-v2-nemotron.yaml (1 hunks)
nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement-nemotron.yaml (1 hunks)
nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement.yaml (1 hunks)
nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_skills/inference/prover.py (5)

nemo_skills/inference/model/base.py (1)

EndpointType (38-41)

nemo_skills/prompt/utils.py (1)

get_prompt (410-443)

nemo_skills/utils.py (4)

get_help_message (341-361)

get_logger_name (39-43)

nested_dataclass (69-102)

parse_reasoning (49-66)

nemo_skills/pipeline/prover.py (1)

generate (39-328)

nemo_skills/inference/lean4_utils.py (5)

extract_code (101-117)

get_error_str (149-205)

parse_error (120-146)

refine_by_sorry (80-98)

replace_statement_in_proof (64-77)

nemo_skills/inference/lean4_utils.py (2)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/inference/patch_litellm_logging.py (1)

start (37-38)

nemo_skills/pipeline/prover.py (10)

nemo_skills/dataset/utils.py (1)

import_from_path (60-67)

nemo_skills/inference/factory.py (1)

GenerationType (18-22)

nemo_skills/pipeline/app.py (1)

typer_unpacker (25-53)

nemo_skills/utils.py (2)

compute_chunk_ids (456-484)

str_ids_to_list (428-453)

nemo_skills/pipeline/utils/server.py (2)

should_get_random_port (62-63)

get_server_command (114-227)

nemo_skills/pipeline/utils/cluster.py (1)

get_cluster_config (314-368)

nemo_skills/pipeline/utils/mounts.py (1)

resolve_mount_paths (317-359)

nemo_skills/inference/generate.py (2)

get_generation_default_args (243-251)

get_server_command_fn (254-264)

nemo_skills/pipeline/utils/generation.py (1)

get_remaining_jobs (60-175)

nemo_skills/pipeline/utils/exp.py (3)

get_exp (753-769)

add_task (416-702)

run_exp (705-750)

🪛 Ruff (0.14.7)

nemo_skills/inference/prover.py

75-75: Avoid specifying long messages outside the exception class

(TRY003)

83-83: Avoid specifying long messages outside the exception class

(TRY003)

86-86: Avoid specifying long messages outside the exception class

(TRY003)

114-114: Unused method argument: data

(ARG002)

119-119: Avoid specifying long messages outside the exception class

(TRY003)

205-205: Unused method argument: data

(ARG002)

268-273: Consider iterable unpacking instead of concatenation

(RUF005)

384-384: Loop control variable i not used within loop body

(B007)

nemo_skills/pipeline/prover.py

58-58: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

69-69: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

110-112: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

160-162: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

191-191: Avoid specifying long messages outside the exception class

(TRY003)

221-221: Avoid specifying long messages outside the exception class

(TRY003)

230-233: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (16)

nemo_skills/inference/factory.py (1)

22-22: LGTM!

The new prover enum member and module mapping follow the established patterns for other generation types in this file.

Also applies to: 29-29

nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml (1)

19-27: LGTM!

The prompt configuration is well-structured with a clear workflow: present the problem, request a proof plan, then generate the formal proof. The {problem} placeholder follows the expected templating convention.

nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement.yaml (1)

19-24: Verify: Missing {proof_attempt} placeholder.

This refinement prompt includes {error_message} but not {proof_attempt}. Compare with goedel-prover-v2-refinement-nemotron.yaml which includes both placeholders. If the proof attempt is meant to be provided via conversation history rather than template substitution, this is fine. Otherwise, users won't see the original proof that failed.

nemo_skills/prompt/config/lean4/goedel-prover-v2-nemotron.yaml (1)

19-28: LGTM!

The nemotron variant provides clear two-phase instructions (proof plan followed by code block) with explicit output formatting requirements. This aligns well with the extract_code function in lean4_utils.py that expects ````lean4` markers.

nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement-nemotron.yaml (1)

19-28: LGTM!

The refinement prompt correctly includes both the previous {proof_attempt} and {error_message} placeholders, giving the model full context for correction. The explicit output formatting aligns with the code extraction logic.

nemo_skills/prompt/config/lean4/formal-proof-deepseek-prover-v2-nemotron.yaml (1)

19-29: LGTM!

The deepseek-style prompt uses appropriate placeholders ({header}, {informal_prefix}, {formal_statement}) with a sorry placeholder indicating where the proof needs to be completed. The two-phase instruction structure is consistent with other nemotron variants.
nemo_skills/pipeline/prover.py (2)
222-223: Default to GenerationType.prover for this prover-specific command?

The fallback defaults to GenerationType.generate when neither generation_module nor generation_type is specified. Since this is prover.py, consider defaulting to GenerationType.prover instead:
     if generation_module is None:
-        generation_module = GENERATION_MODULE_MAP[generation_type or GenerationType.generate]
+        generation_module = GENERATION_MODULE_MAP[generation_type or GenerationType.prover]
Alternatively, if this is intentionally a generic generation pipeline, the file might be better named to reflect that generality.

37-163: Static analysis notes: B008 and TRY003 are acceptable here.

The Ruff B008 warnings about typer.Option calls in default arguments are false positives - this is the standard Typer pattern required for CLI option definitions. The TRY003 warnings about long exception messages are minor style preferences that don't affect functionality.
nemo_skills/inference/lean4_utils.py (1)
182-187: Potential issue with loop variable j scope.

At line 186, j is used outside the for-loop to calculate leading_spaces, but if end_line <= start_line + 1, the loop body never executes and j would be undefined (NameError). This edge case could occur when the error spans exactly 2 lines.
             else:
                 show_line = 6
+                j = start_line  # Initialize in case loop doesn't execute
                 for j in range(start_line + 1, min(end_line, start_line + show_line)):
                     error_code += f"{code_lines[j]}\n"
                 if end_line > start_line + show_line:
                     leading_spaces = len(code_lines[j]) - len(code_lines[j].lstrip(" "))
Likely an incorrect or invalid review comment.
nemo_skills/inference/prover.py (7)

15-51: Imports and module setup look appropriate.

The imports are well-organized and the reasoning_effort_list constant is appropriately scoped at module level for use in the adaptive reasoning logic.

114-123: Method overrides are appropriate.

The empty log_example_prompt and the data parameter are inherited interface requirements. The setup_llm correctly differentiates between generic code execution (rejected) and Lean4-specific sandbox usage.

150-188: Adaptive reasoning logic is well-structured.

The method correctly handles the inference config as both dataclass and dict, overrides endpoint type for pre-templated prompts, and implements a proper fallback loop for adaptive reasoning. The loop properly bounds at reasoning_effort_index > 0.

379-394: pass_at_N logic is correct.

The loop variable i is used after the loop to compute n_pass count. The results_dict correctly captures the last attempt's results. The static analysis warning about unused i is a false positive.

396-428: Entry points and exports are well-structured.

The process_single_datapoint properly wraps results, GENERATION_TASK_CLASS enables factory pattern registration, and the Hydra main provides proper CLI integration with help message support.

54-90: ProverConfig validation logic is sound.

The configuration class properly extends GenerateSolutionsConfig and validates Lean4-specific parameters. The _get_disallowed_params() method is correctly inherited from the parent class, and the validation in _post_init_validate_params ensures proper configuration state.

138-148: No action needed—tokenizer is not required for refinement prompts.

Both refine_prompt and nemotron_refine_prompt are only used via the fill() method, which operates correctly without a tokenizer by returning a list of message dictionaries. The main prompt uses tokenizer for string formatting via apply_chat_template, while refinement prompts use list-based message format. This is an intentional design pattern, not an oversight.

coderabbitai · 2025-12-05T02:05:06Z

nemo_skills/inference/prover.py

+    async def _single_data_point_generate(self, data_point, data):
+        formal_statement = (
+            (data_point["header"].strip() + "\n")
+            + data_point["informal_prefix"].strip()
+            + ("\n" + data_point["formal_statement"].strip())
+        )
+        formal_statement = refine_by_sorry(formal_statement)
+        prompt_turn_list = self.prompt.fill({"problem": formal_statement.strip()})


⚠️ Potential issue | 🔴 Critical

Potential NullPointerError when prompt_format == "openai".

setup_prompt() returns None when prompt_format == "openai" (line 127), but _single_data_point_generate calls self.prompt.fill() unconditionally at line 212. This will cause an AttributeError.

Either guard the prompt access or ensure _post_init_validate_params enforces that prompt_format != "openai" when this code path is used. If OpenAI format is not intended for this prover workflow, consider failing earlier with a clear message.

🧰 Tools

🪛 Ruff (0.14.7)

205-205: Unused method argument: data

(ARG002)

🤖 Prompt for AI Agents

In nemo_skills/inference/prover.py around lines 205 to 212, self.prompt may be None when setup_prompt() returned None for prompt_format == "openai", but _single_data_point_generate calls self.prompt.fill() unconditionally which will raise an AttributeError; fix by either (A) adding a guard at the start of _single_data_point_generate that checks if self.prompt is None and raises a clear, early ValueError/RuntimeError describing that this prover does not support prompt_format "openai" (or falls back to an alternative prompt creation), or (B) enforce during initialization in _post_init_validate_params that prompt_format != "openai" (raise a clear exception at init time) so self.prompt is guaranteed non-None when this method runs; implement one of these two fixes and include a descriptive error message for easier debugging.

coderabbitai · 2025-12-05T02:05:06Z

nemo_skills/inference/prover.py

+                feedback = self.refine_prompt.fill({"error_message": last_error_message})
+                results_dict["feedback"] = feedback[0]["content"]
+            else:
+                # execute_code returns (result_dict, session_id) tuple
+                execution_result, _ = await self.sandbox.execute_code(
+                    full_code, language="lean4", timeout=600.0, max_output_characters=1000000
+                )


⚠️ Potential issue | 🟠 Major

Missing null checks for refine_prompt and sandbox.

self.refine_prompt is accessed at line 308, but it's only initialized when cfg.refinement is True (line 108-109). If refinement is disabled but this code path is reached (parsing error), this will raise AttributeError.

self.sandbox.execute_code() is called at line 312, but self.sandbox can be None if cfg.sandbox is None (line 121).

Consider adding guards or ensuring these paths are only reachable when the required objects are initialized:

+ if not self.cfg.refinement or self.refine_prompt is None: + results_dict["feedback"] = None + else: feedback = self.refine_prompt.fill({"error_message": last_error_message}) results_dict["feedback"] = feedback[0]["content"]

For the sandbox, consider failing early if sandbox is required but not configured.

🤖 Prompt for AI Agents

In nemo_skills/inference/prover.py around lines 308 to 314, add null checks before using self.refine_prompt and self.sandbox: guard access to self.refine_prompt before calling fill() (if refinement is disabled, either skip setting feedback or raise a clear RuntimeError/ValueError indicating refinement is required for this code path), and validate self.sandbox before calling execute_code (if sandbox is None and execution is required, raise a clear configuration error early with a descriptive message); ensure these guards produce deterministic behavior (either skip refinement safely or fail fast) so AttributeError/NoneType errors cannot occur.

coderabbitai · 2025-12-05T02:05:06Z

nemo_skills/pipeline/prover.py

+    """Generate LLM completions for a given input file.
+
+    Run `python -m nemo_skills.inference.generate --help` for other supported arguments
+    (need to be prefixed with ++, since we use Hydra for that script).
+    """


⚠️ Potential issue | 🟡 Minor

Docstring references incorrect module.

The docstring mentions nemo_skills.inference.generate but this is prover.py. Consider updating the help reference:

"""Generate LLM completions for a given input file. - Run `python -m nemo_skills.inference.generate --help` for other supported arguments + Run `python -m nemo_skills.inference.prover --help` for other supported arguments (need to be prefixed with ++, since we use Hydra for that script). """

🤖 Prompt for AI Agents

In nemo_skills/pipeline/prover.py around lines 164 to 168, the docstring incorrectly references nemo_skills.inference.generate; update the help reference to point to the correct module/script for this file (e.g., nemo_skills.pipeline.prover or the actual entrypoint that runs this prover) and adjust the example command accordingly (preserve the note about ++-prefixed Hydra args if applicable). Replace the outdated module name with the correct module/CLI name and ensure the surrounding text remains accurate and concise.

gwarmstrong

@stephencge is it possible to rewrite this to use the ns generate pipeline and work by using a generation module, rather than standalone pipeline and inference script?

gwarmstrong · 2025-12-05T04:24:31Z

rewrite this to use the ns generate pipeline

I see--this actually uses the generate pipeline already, but pipeline/prover.py is probably extraneous

gwarmstrong

Requesting changes for concurrency control, as well as a variety of housekeeping items.

gwarmstrong · 2025-12-05T18:25:26Z

nemo_skills/inference/factory.py

    GenerationType.generate: "nemo_skills.inference.generate",
    GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
    GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
+    GenerationType.prover: "nemo_skills.inference.prover",


Do you think this warrants it's own generation type? I worry that having this as an option may be confusing because (a) as far as I understand this is really only for formal proofs and (b) this is a fairly specific proving workflow that might not apply to all proving. If we give it a generic type name like "prover" people may turn it on when they are doing proof generation that doesn't necessarily need this specific workflow.

If we remove from here, you just have to call the generation module specifically, e.g.,

generate( ... generation_module="nemo_skills.inference.prover", ... )

yeah, we don't need it in main script, let's just use as an explicit option when calling eval

removing from factory.py

gwarmstrong · 2025-12-05T18:27:31Z

nemo_skills/inference/lean4_utils.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+


Can we move to a file like nemo_skills/code_execution/proof_utils.py?

gwarmstrong · 2025-12-05T18:27:44Z

nemo_skills/inference/lean4_utils.py

@@ -0,0 +1,205 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


Suggested change

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

gwarmstrong · 2025-12-05T18:28:20Z

nemo_skills/inference/lean4_utils.py

+    # warning_pattern = re.compile(
+    #    r"(/lean4/my_project/.*?:\d+:\d+: warning:.*?)(?=\n/lean4/my_project|\Z)",
+    #    re.DOTALL,
+    # )


Suggested change

# warning_pattern = re.compile(

# r"(/lean4/my_project/.*?:\d+:\d+: warning:.*?)(?=\n/lean4/my_project|\Z)",

# re.DOTALL,

# )

gwarmstrong · 2025-12-05T18:28:28Z

nemo_skills/inference/lean4_utils.py

+        re.DOTALL,
+    )
+    # Find all warnings and errors
+    # warnings = warning_pattern.findall(log_string)


Suggested change

# warnings = warning_pattern.findall(log_string)

gwarmstrong · 2025-12-05T18:36:17Z

nemo_skills/inference/prover.py

+        llm = get_model(**self.cfg.server, tokenizer=self.tokenizer)
+        return llm
+
+    def setup_prompt(self):


to me, this looks the same as the super method, can you confirm and delete if redundant?

gwarmstrong · 2025-12-05T20:57:17Z

nemo_skills/inference/prover.py

+        generation_params["endpoint_type"] = EndpointType.text
+        for key, value in kwargs.items():
+            generation_params[key] = value
+        generation = await self.llm.generate_async(**generation_params)


It looks like we're currently missing any kind of concurrency control. I think if you're working large datasets, this could put huge memory and network pressure on the nemo-skills process as well as the server.

We have a built-in functionality in GenerationTask to handle this called generate_with_sempahore:

Skills/nemo_skills/inference/generate.py

Lines 592 to 600 in fcb6874

async def generate_with_semaphore(self, **generation_params):

"""Generate with semaphore control.

Ensures no more than max_concurrent_requests LLM calls can be run at the same time.

Should work even if process_single_datapoint is doing multiple requests in parallel

as long as those requests also use this function.

"""

async with self.semaphore:

return await self.llm.generate_async(**generation_params)

I fear that that implementation might not be totally right for you though, since you make multiple LLM calls per sample. You would end up round-robining across all samples, rather than running some samples to completion before starting the next one... so It's probably better to roll your own concurrency control? I would take a similar approach to generate_with_sempahore, but probably use the semaphore in process_single_datapoint--this will be a little quality of life improvement in case there are any failures during generation--you will have more completed generations saved.

Actually--after seeing that pass_at_N is independent, you probably want to do it around _single_data_point_generate.

gwarmstrong · 2025-12-05T21:00:33Z

nemo_skills/inference/prover.py

+
+        new_results_dict = {"success": False}
+        for i in range(N):
+            results_dict = await self._single_data_point_generate(data_point, data)


these are independent, but they are run serially. Use a loop like this to run them concurrently:

Skills/nemo_skills/inference/generate.py

Lines 667 to 674 in fcb6874

tasks = []

for data_point in remaining_data_points:

task = asyncio.create_task(self._generate_and_save_datapoint(data_point, data, fout, pbar))

tasks.append(task)

# Wait for all tasks to complete

if tasks:

await asyncio.gather(*tasks)

gwarmstrong · 2025-12-05T21:03:52Z

nemo_skills/inference/prover.py

+            }
+        )
+
+    async def _single_data_point_generate(self, data_point, data):


per this comment, wrap this whole method in with self.semaphore: https://github.com/NVIDIA-NeMo/Skills/pull/1078/files#r2593972374

You can make a helper to not have to indent this whole block if you want.

gwarmstrong · 2025-12-05T21:10:37Z

nemo_skills/inference/prover.py

+
+    def _transform_for_nemotron_refinement(self, proof_attempt: str, error_message: str) -> list[dict]:
+        """Transform multi-turn refinement into single-turn nemotron-style prompt."""
+        return self.nemotron_refine_prompt.fill(


consider if there are any of the config args you want to pass through here (or for the other prompts, e.g., turning on thinking in chat template):

Skills/nemo_skills/inference/generate.py

Lines 511 to 516 in fcb6874

filled_prompt = self.prompt.fill(

data_point,

start_assistant_response_key=self.cfg.start_assistant_response_key,

chat_template_kwargs=self.cfg.chat_template_kwargs,

format_as_string=(self.cfg.inference.endpoint_type == EndpointType.text),

)

I realize it could get tricky here since there are multiple prompts--might have to give each of them separate args.

Kipok

If it takes too long to fix issues, we can also merge an intermediate version early next week and refine later as long as it doesn't break any other functionality. But this is a little time sensitive to get in

Kipok · 2025-12-07T17:49:36Z

nemo_skills/inference/factory.py

    GenerationType.generate: "nemo_skills.inference.generate",
    GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
    GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
+    GenerationType.prover: "nemo_skills.inference.prover",


yeah, we don't need it in main script, let's just use as an explicit option when calling eval

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

gwarmstrong

Have a few comments unresolved about concurrency behavior. Don't need to be addressed now. Have an issue to track in #1083

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

stephencge added 4 commits December 4, 2025 14:56

Add prover module for Lean4 theorem proving

9eccc58

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

remove dependence on autoformalization

3a545fc

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

tokenizer

605a898

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

fixes

b5d527b

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

stephencge requested review from Kipok and gwarmstrong December 5, 2025 01:58

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

gwarmstrong requested changes Dec 5, 2025

View reviewed changes

Kipok reviewed Dec 7, 2025

View reviewed changes

stephencge added 2 commits December 8, 2025 06:29

update inference.py, remove pipline/prover.py and from factory

d8cc81b

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

lean4_utils into proof_utils

1d2e92e

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

gwarmstrong mentioned this pull request Dec 9, 2025

Concurrency improvements for prover generation #1083

Open

Merge branch 'main' into stepheng/prover-cleanup

d380940

gwarmstrong approved these changes Dec 9, 2025

View reviewed changes

stephencge and others added 4 commits December 9, 2025 10:19

defer to super

761be61

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

Merge branch 'main' into stepheng/prover-cleanup

7fdd77e

Merge branch 'main' into stepheng/prover-cleanup

91fc708

Merge branch 'main' into stepheng/prover-cleanup

8fc6202

gwarmstrong enabled auto-merge (squash) December 11, 2025 01:42

gwarmstrong merged commit 0bbba89 into main Dec 11, 2025
5 checks passed

gwarmstrong deleted the stepheng/prover-cleanup branch December 11, 2025 01:59

gwarmstrong added a commit that referenced this pull request Dec 11, 2025

Stepheng/prover cleanup (#1078)

cbc61bb

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Dec 12, 2025

Stepheng/prover cleanup (#1078)

76d9e86

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad pushed a commit that referenced this pull request Dec 19, 2025

Stepheng/prover cleanup (#1078)

5a6ac07

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Dec 19, 2025

Stepheng/prover cleanup (#1078)

a5b3bd7

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026

Stepheng/prover cleanup (#1078)

70cc6bf

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

wasiahmad pushed a commit that referenced this pull request Feb 4, 2026

Stepheng/prover cleanup (#1078)

e4aa660

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

coderabbitai bot mentioned this pull request Mar 2, 2026

fix: clean parse-failure retries in prover #1284

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Stepheng/prover cleanup (#1078)

dcec7df

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Stepheng/prover cleanup (#1078)

bfd65b4

Signed-off-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

		@@ -0,0 +1,205 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	async def generate_with_semaphore(self, **generation_params):
	"""Generate with semaphore control.

	Ensures no more than max_concurrent_requests LLM calls can be run at the same time.
	Should work even if process_single_datapoint is doing multiple requests in parallel
	as long as those requests also use this function.
	"""
	async with self.semaphore:
	return await self.llm.generate_async(**generation_params)

	tasks = []
	for data_point in remaining_data_points:
	task = asyncio.create_task(self._generate_and_save_datapoint(data_point, data, fout, pbar))
	tasks.append(task)

	# Wait for all tasks to complete
	if tasks:
	await asyncio.gather(*tasks)

	filled_prompt = self.prompt.fill(
	data_point,
	start_assistant_response_key=self.cfg.start_assistant_response_key,
	chat_template_kwargs=self.cfg.chat_template_kwargs,
	format_as_string=(self.cfg.inference.endpoint_type == EndpointType.text),
	)

Conversation

stephencge commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 5, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Dec 5, 2025

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephencge commented Dec 5, 2025 •

edited by coderabbitai bot

Loading