Skip to content

Stepheng/prover cleanup#1078

Merged
gwarmstrong merged 11 commits intomainfrom
stepheng/prover-cleanup
Dec 11, 2025
Merged

Stepheng/prover cleanup#1078
gwarmstrong merged 11 commits intomainfrom
stepheng/prover-cleanup

Conversation

@stephencge
Copy link
Collaborator

@stephencge stephencge commented Dec 5, 2025

Summary by CodeRabbit

Release Notes

  • New Features
    • Introduced Lean4 automated theorem proving capability with multi-stage refinement support
    • Added new CLI generate command to orchestrate and distribute proof generation tasks
    • Added Lean4 code utilities for proof parsing, error extraction, and optimization
    • Added multiple prompt configurations for diverse Lean4 proving workflows and refinement strategies

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Lean 4 prover system, including new utilities for Lean4 text processing and error handling, a ProverTask class implementing multi-stage proof generation with refinement capabilities, a CLI command for distributed task execution, and associated prompt configurations.

Changes

Cohort / File(s) Change Summary
Factory & Core Integration
nemo_skills/inference/factory.py
Added new enum member GenerationType.prover = "prover" and corresponding module mapping in GENERATION_MODULE_MAP pointing to nemo_skills.inference.prover.
Lean4 Utilities
nemo_skills/inference/lean4_utils.py
New module providing 9 utility functions for Lean4 text processing and error reporting: remove_comments, move_imports_to_beginning, return_theorem_to_prove, return_theorem_to_replace, replace_statement_in_proof, refine_by_sorry, extract_code, parse_error, and get_error_str; includes module-level logger and robust regex-based string extraction.
Prover Task Implementation
nemo_skills/inference/prover.py
New module defining ProverConfig (extends GenerateSolutionsConfig with Lean4-specific parameters) and ProverTask (subclass of GenerationTask) implementing multi-stage proof generation loop with adaptive reasoning, iterative refinement, SFT-friendly prompt management, sandbox integration, and Lean4 code execution. Exposes GENERATION_TASK_CLASS, generate() Hydra entry point, and HELP_MESSAGE.
Pipeline Integration
nemo_skills/pipeline/prover.py
New CLI command generate with extensive options (cluster, input/output paths, generation type/module, model, server config, chunking, seeds, logging, dependencies, mounting, pre/post processing). Orchestrates distributed task execution via dynamic module loading, seed computation, chunking support, wandb logging, and experiment scheduling.
Lean4 Prompt Configurations
nemo_skills/prompt/config/lean4/*-nemotron.yaml
nemo_skills/prompt/config/lean4/*-refinement*.yaml
nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml
Five new YAML configuration files: formal-proof-deepseek-prover-v2-nemotron.yaml, goedel-prover-v2-nemotron.yaml, goedel-prover-v2-refinement-nemotron.yaml, goedel-prover-v2-refinement.yaml, and goedel-prover-v2.yaml. Provide Lean4-specific prompting guidance for theorem completion with structured workflows (proof plan followed by code generation, error analysis for refinement).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as Pipeline CLI
    participant Task as ProverTask
    participant LLM
    participant Sandbox as Sandbox Executor
    participant Logger as Logger

    User->>CLI: invoke generate command
    CLI->>CLI: resolve generation module<br/>(ProverTask)
    CLI->>CLI: compute seeds & chunks
    CLI->>Task: create ProverTask instance<br/>with ProverConfig

    loop for each datapoint
        Task->>Task: _single_data_point_generate()
        
        loop for each pass (pass_at_N)
            Task->>LLM: get initial generation<br/>(with adaptive reasoning)
            LLM-->>Task: raw proof generation
            
            Task->>Task: extract & parse<br/>Lean4 code
            
            alt parsing succeeds
                Task->>Sandbox: execute proof code
                Sandbox-->>Task: execution result
                
                alt proof succeeds
                    Task->>Logger: log success
                    Note over Task: Save result
                else execution error/timeout
                    Task->>Logger: log error
                    Task->>Task: parse error message
                    
                    alt refinement enabled
                        Task->>LLM: generate refinement<br/>(with error context)
                        LLM-->>Task: refined proof
                        Task->>Task: replace proof in statement
                        Task->>Sandbox: re-execute refined proof
                        Sandbox-->>Task: execution result
                    end
                end
            else parsing fails
                Task->>Logger: log parse error
                alt refinement enabled
                    Task->>LLM: request correction<br/>(with parse error)
                    LLM-->>Task: corrected proof
                end
            end
        end
        
        Task->>Logger: finalize datapoint results
    end
    
    CLI->>CLI: run experiment via pipeline_utils
    CLI-->>User: return results/tasks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

  • nemo_skills/inference/prover.py: Largest and most logic-dense file with multi-stage generation loop, adaptive reasoning, refinement flow, error handling, and SFT-specific prompt management. Requires careful review of control flow and interaction with sandbox/LLM.
  • nemo_skills/pipeline/prover.py: Extensive CLI with 20+ parameters, dynamic module loading, seed/chunk computation, and experiment orchestration. Complex interaction between multiple utilities and configuration resolution.
  • nemo_skills/inference/lean4_utils.py: Nine utility functions with regex-based parsing, string manipulation, and error formatting. Needs verification of edge cases in extraction and error message rendering.
  • Cross-module integration: ProverTask's use of lean4_utils functions, factory registration, and pipeline's dynamic class loading require understanding the full integration flow.
  • Configuration files: While simple, verify that prompt placeholders and workflows align with ProverTask's expectations.

Suggested reviewers

  • Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Stepheng/prover cleanup' is vague and uses generic terminology that doesn't convey meaningful information about what was actually changed. It reads like a branch name rather than describing the specific changes made. Provide a more descriptive title that clearly summarizes the main changes, such as 'Add Lean4 prover task with refinement capabilities' or 'Introduce ProverTask for iterative Lean4 proof generation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch stepheng/prover-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
nemo_skills/inference/lean4_utils.py (2)

101-117: Consider returning None instead of string "None".

The function returns the literal string "None" when no code block is found, which can be confusing and error-prone for callers who might check for Python's None. Consider returning None instead:

-    return "None"
+    return None

Callers would then use if result is None: instead of if result == "None":.

Also note: the hardcoded import_head may not suit all Lean4 use cases. Consider making it configurable if needed for different proof contexts.


127-130: Hardcoded path /lean4/my_project/ may limit portability.

The error pattern assumes Lean4 errors come from a specific /lean4/my_project/ path. This works in a containerized environment but may fail in other setups. Consider making the path prefix configurable or using a more flexible pattern if this utility needs to work outside the expected container environment.

nemo_skills/pipeline/prover.py (1)

253-262: Variable shadowing: chunk_ids reused in loop.

At line 254, chunk_ids from the outer scope (computed at line 198-200) is shadowed by iterating over remaining_jobs.items(). This works but could cause confusion. Consider renaming:

-        for seed_idx, (seed, chunk_ids) in enumerate(remaining_jobs.items()):
+        for seed_idx, (seed, seed_chunk_ids) in enumerate(remaining_jobs.items()):
             if wandb_parameters:
                 # no need for chunks as it will run after merging
                 wandb_parameters["samples_file"] = pipeline_utils.get_chunked_rs_filename(
                     output_dir,
                     random_seed=seed,
                     chunk_id=None,
                 )
-            for chunk_id in chunk_ids:
+            for chunk_id in seed_chunk_ids:
nemo_skills/inference/prover.py (3)

104-106: Add validation for tokenizer path.

If both cfg.tokenizer is None and cfg.server.get("model") returns None, AutoTokenizer.from_pretrained(None) will raise an unclear error. Consider adding explicit validation.

         # Initialize tokenizer for chat template application
         tokenizer_path = self.cfg.tokenizer or self.cfg.server.get("model")
+        if tokenizer_path is None:
+            raise ValueError("tokenizer or server.model must be specified for chat template application")
         self.hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

286-304: Dead code: unreachable else branch.

The else branch at lines 299-304 is unreachable because the preceding conditions (code == "None" and "**Error**" in full_code) are exhaustive for the outer if condition.

             if code == "None" or "**Error**" in full_code:
                 if code == "None":
                     execution_result = {
                         "process_status": "failed",
                         "stderr": "",
                         "stdout": "Parsing error. Cannot parse the code from output. Please try again and write the code in the format of ```lean4\n<code>\n```",
                     }
-                elif "**Error**" in full_code:
+                else:  # "**Error**" in full_code
                     execution_result = {
                         "process_status": "failed",
                         "stderr": "",
                         "stdout": full_code,
                     }
-                else:
-                    execution_result = {
-                        "process_status": "failed",
-                        "stderr": "",
-                        "stdout": "Unknown error when parsing code.",
-                    }

312-313: Consider making sandbox timeout configurable.

The timeout of 600.0 seconds is hardcoded. For long-running proofs or faster iteration during development, this could be a configuration parameter.

 @nested_dataclass(kw_only=True)
 class ProverConfig(GenerateSolutionsConfig):
     max_tokens: int = 40960  # model max tokens
     n_pass: int = 1  # number of passes to run the prover
+    sandbox_timeout: float = 600.0  # timeout for Lean4 code execution in seconds

Then use self.cfg.sandbox_timeout instead of the hardcoded value.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc1dcc and b5d527b.

📒 Files selected for processing (9)
  • nemo_skills/inference/factory.py (1 hunks)
  • nemo_skills/inference/lean4_utils.py (1 hunks)
  • nemo_skills/inference/prover.py (1 hunks)
  • nemo_skills/pipeline/prover.py (1 hunks)
  • nemo_skills/prompt/config/lean4/formal-proof-deepseek-prover-v2-nemotron.yaml (1 hunks)
  • nemo_skills/prompt/config/lean4/goedel-prover-v2-nemotron.yaml (1 hunks)
  • nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement-nemotron.yaml (1 hunks)
  • nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement.yaml (1 hunks)
  • nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/prover.py (5)
nemo_skills/inference/model/base.py (1)
  • EndpointType (38-41)
nemo_skills/prompt/utils.py (1)
  • get_prompt (410-443)
nemo_skills/utils.py (4)
  • get_help_message (341-361)
  • get_logger_name (39-43)
  • nested_dataclass (69-102)
  • parse_reasoning (49-66)
nemo_skills/pipeline/prover.py (1)
  • generate (39-328)
nemo_skills/inference/lean4_utils.py (5)
  • extract_code (101-117)
  • get_error_str (149-205)
  • parse_error (120-146)
  • refine_by_sorry (80-98)
  • replace_statement_in_proof (64-77)
nemo_skills/inference/lean4_utils.py (2)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/inference/patch_litellm_logging.py (1)
  • start (37-38)
nemo_skills/pipeline/prover.py (10)
nemo_skills/dataset/utils.py (1)
  • import_from_path (60-67)
nemo_skills/inference/factory.py (1)
  • GenerationType (18-22)
nemo_skills/pipeline/app.py (1)
  • typer_unpacker (25-53)
nemo_skills/utils.py (2)
  • compute_chunk_ids (456-484)
  • str_ids_to_list (428-453)
nemo_skills/pipeline/utils/server.py (2)
  • should_get_random_port (62-63)
  • get_server_command (114-227)
nemo_skills/pipeline/utils/cluster.py (1)
  • get_cluster_config (314-368)
nemo_skills/pipeline/utils/mounts.py (1)
  • resolve_mount_paths (317-359)
nemo_skills/inference/generate.py (2)
  • get_generation_default_args (243-251)
  • get_server_command_fn (254-264)
nemo_skills/pipeline/utils/generation.py (1)
  • get_remaining_jobs (60-175)
nemo_skills/pipeline/utils/exp.py (3)
  • get_exp (753-769)
  • add_task (416-702)
  • run_exp (705-750)
🪛 Ruff (0.14.7)
nemo_skills/inference/prover.py

75-75: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


86-86: Avoid specifying long messages outside the exception class

(TRY003)


114-114: Unused method argument: data

(ARG002)


119-119: Avoid specifying long messages outside the exception class

(TRY003)


205-205: Unused method argument: data

(ARG002)


268-273: Consider iterable unpacking instead of concatenation

(RUF005)


384-384: Loop control variable i not used within loop body

(B007)

nemo_skills/pipeline/prover.py

58-58: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


69-69: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


110-112: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


160-162: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


191-191: Avoid specifying long messages outside the exception class

(TRY003)


221-221: Avoid specifying long messages outside the exception class

(TRY003)


230-233: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (16)
nemo_skills/inference/factory.py (1)

22-22: LGTM!

The new prover enum member and module mapping follow the established patterns for other generation types in this file.

Also applies to: 29-29

nemo_skills/prompt/config/lean4/goedel-prover-v2.yaml (1)

19-27: LGTM!

The prompt configuration is well-structured with a clear workflow: present the problem, request a proof plan, then generate the formal proof. The {problem} placeholder follows the expected templating convention.

nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement.yaml (1)

19-24: Verify: Missing {proof_attempt} placeholder.

This refinement prompt includes {error_message} but not {proof_attempt}. Compare with goedel-prover-v2-refinement-nemotron.yaml which includes both placeholders. If the proof attempt is meant to be provided via conversation history rather than template substitution, this is fine. Otherwise, users won't see the original proof that failed.

nemo_skills/prompt/config/lean4/goedel-prover-v2-nemotron.yaml (1)

19-28: LGTM!

The nemotron variant provides clear two-phase instructions (proof plan followed by code block) with explicit output formatting requirements. This aligns well with the extract_code function in lean4_utils.py that expects ````lean4` markers.

nemo_skills/prompt/config/lean4/goedel-prover-v2-refinement-nemotron.yaml (1)

19-28: LGTM!

The refinement prompt correctly includes both the previous {proof_attempt} and {error_message} placeholders, giving the model full context for correction. The explicit output formatting aligns with the code extraction logic.

nemo_skills/prompt/config/lean4/formal-proof-deepseek-prover-v2-nemotron.yaml (1)

19-29: LGTM!

The deepseek-style prompt uses appropriate placeholders ({header}, {informal_prefix}, {formal_statement}) with a sorry placeholder indicating where the proof needs to be completed. The two-phase instruction structure is consistent with other nemotron variants.

nemo_skills/pipeline/prover.py (2)

222-223: Default to GenerationType.prover for this prover-specific command?

The fallback defaults to GenerationType.generate when neither generation_module nor generation_type is specified. Since this is prover.py, consider defaulting to GenerationType.prover instead:

     if generation_module is None:
-        generation_module = GENERATION_MODULE_MAP[generation_type or GenerationType.generate]
+        generation_module = GENERATION_MODULE_MAP[generation_type or GenerationType.prover]

Alternatively, if this is intentionally a generic generation pipeline, the file might be better named to reflect that generality.


37-163: Static analysis notes: B008 and TRY003 are acceptable here.

The Ruff B008 warnings about typer.Option calls in default arguments are false positives - this is the standard Typer pattern required for CLI option definitions. The TRY003 warnings about long exception messages are minor style preferences that don't affect functionality.

nemo_skills/inference/lean4_utils.py (1)

182-187: Potential issue with loop variable j scope.

At line 186, j is used outside the for-loop to calculate leading_spaces, but if end_line <= start_line + 1, the loop body never executes and j would be undefined (NameError). This edge case could occur when the error spans exactly 2 lines.

             else:
                 show_line = 6
+                j = start_line  # Initialize in case loop doesn't execute
                 for j in range(start_line + 1, min(end_line, start_line + show_line)):
                     error_code += f"{code_lines[j]}\n"
                 if end_line > start_line + show_line:
                     leading_spaces = len(code_lines[j]) - len(code_lines[j].lstrip(" "))

Likely an incorrect or invalid review comment.

nemo_skills/inference/prover.py (7)

15-51: Imports and module setup look appropriate.

The imports are well-organized and the reasoning_effort_list constant is appropriately scoped at module level for use in the adaptive reasoning logic.


114-123: Method overrides are appropriate.

The empty log_example_prompt and the data parameter are inherited interface requirements. The setup_llm correctly differentiates between generic code execution (rejected) and Lean4-specific sandbox usage.


150-188: Adaptive reasoning logic is well-structured.

The method correctly handles the inference config as both dataclass and dict, overrides endpoint type for pre-templated prompts, and implements a proper fallback loop for adaptive reasoning. The loop properly bounds at reasoning_effort_index > 0.


379-394: pass_at_N logic is correct.

The loop variable i is used after the loop to compute n_pass count. The results_dict correctly captures the last attempt's results. The static analysis warning about unused i is a false positive.


396-428: Entry points and exports are well-structured.

The process_single_datapoint properly wraps results, GENERATION_TASK_CLASS enables factory pattern registration, and the Hydra main provides proper CLI integration with help message support.


54-90: ProverConfig validation logic is sound.

The configuration class properly extends GenerateSolutionsConfig and validates Lean4-specific parameters. The _get_disallowed_params() method is correctly inherited from the parent class, and the validation in _post_init_validate_params ensures proper configuration state.


138-148: No action needed—tokenizer is not required for refinement prompts.

Both refine_prompt and nemotron_refine_prompt are only used via the fill() method, which operates correctly without a tokenizer by returning a list of message dictionaries. The main prompt uses tokenizer for string formatting via apply_chat_template, while refinement prompts use list-based message format. This is an intentional design pattern, not an oversight.

Comment on lines +205 to +212
async def _single_data_point_generate(self, data_point, data):
formal_statement = (
(data_point["header"].strip() + "\n")
+ data_point["informal_prefix"].strip()
+ ("\n" + data_point["formal_statement"].strip())
)
formal_statement = refine_by_sorry(formal_statement)
prompt_turn_list = self.prompt.fill({"problem": formal_statement.strip()})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential NullPointerError when prompt_format == "openai".

setup_prompt() returns None when prompt_format == "openai" (line 127), but _single_data_point_generate calls self.prompt.fill() unconditionally at line 212. This will cause an AttributeError.

Either guard the prompt access or ensure _post_init_validate_params enforces that prompt_format != "openai" when this code path is used. If OpenAI format is not intended for this prover workflow, consider failing earlier with a clear message.

🧰 Tools
🪛 Ruff (0.14.7)

205-205: Unused method argument: data

(ARG002)

🤖 Prompt for AI Agents
In nemo_skills/inference/prover.py around lines 205 to 212, self.prompt may be
None when setup_prompt() returned None for prompt_format == "openai", but
_single_data_point_generate calls self.prompt.fill() unconditionally which will
raise an AttributeError; fix by either (A) adding a guard at the start of
_single_data_point_generate that checks if self.prompt is None and raises a
clear, early ValueError/RuntimeError describing that this prover does not
support prompt_format "openai" (or falls back to an alternative prompt
creation), or (B) enforce during initialization in _post_init_validate_params
that prompt_format != "openai" (raise a clear exception at init time) so
self.prompt is guaranteed non-None when this method runs; implement one of these
two fixes and include a descriptive error message for easier debugging.

Comment on lines +308 to +314
feedback = self.refine_prompt.fill({"error_message": last_error_message})
results_dict["feedback"] = feedback[0]["content"]
else:
# execute_code returns (result_dict, session_id) tuple
execution_result, _ = await self.sandbox.execute_code(
full_code, language="lean4", timeout=600.0, max_output_characters=1000000
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing null checks for refine_prompt and sandbox.

  1. self.refine_prompt is accessed at line 308, but it's only initialized when cfg.refinement is True (line 108-109). If refinement is disabled but this code path is reached (parsing error), this will raise AttributeError.

  2. self.sandbox.execute_code() is called at line 312, but self.sandbox can be None if cfg.sandbox is None (line 121).

Consider adding guards or ensuring these paths are only reachable when the required objects are initialized:

+            if not self.cfg.refinement or self.refine_prompt is None:
+                results_dict["feedback"] = None
+            else:
                 feedback = self.refine_prompt.fill({"error_message": last_error_message})
                 results_dict["feedback"] = feedback[0]["content"]

For the sandbox, consider failing early if sandbox is required but not configured.

🤖 Prompt for AI Agents
In nemo_skills/inference/prover.py around lines 308 to 314, add null checks
before using self.refine_prompt and self.sandbox: guard access to
self.refine_prompt before calling fill() (if refinement is disabled, either skip
setting feedback or raise a clear RuntimeError/ValueError indicating refinement
is required for this code path), and validate self.sandbox before calling
execute_code (if sandbox is None and execution is required, raise a clear
configuration error early with a descriptive message); ensure these guards
produce deterministic behavior (either skip refinement safely or fail fast) so
AttributeError/NoneType errors cannot occur.

Comment on lines +164 to +168
"""Generate LLM completions for a given input file.

Run `python -m nemo_skills.inference.generate --help` for other supported arguments
(need to be prefixed with ++, since we use Hydra for that script).
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring references incorrect module.

The docstring mentions nemo_skills.inference.generate but this is prover.py. Consider updating the help reference:

     """Generate LLM completions for a given input file.
 
-    Run `python -m nemo_skills.inference.generate --help` for other supported arguments
+    Run `python -m nemo_skills.inference.prover --help` for other supported arguments
     (need to be prefixed with ++, since we use Hydra for that script).
     """
🤖 Prompt for AI Agents
In nemo_skills/pipeline/prover.py around lines 164 to 168, the docstring
incorrectly references nemo_skills.inference.generate; update the help reference
to point to the correct module/script for this file (e.g.,
nemo_skills.pipeline.prover or the actual entrypoint that runs this prover) and
adjust the example command accordingly (preserve the note about ++-prefixed
Hydra args if applicable). Replace the outdated module name with the correct
module/CLI name and ensure the surrounding text remains accurate and concise.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephencge is it possible to rewrite this to use the ns generate pipeline and work by using a generation module, rather than standalone pipeline and inference script?

@gwarmstrong
Copy link
Collaborator

rewrite this to use the ns generate pipeline

I see--this actually uses the generate pipeline already, but pipeline/prover.py is probably extraneous

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for concurrency control, as well as a variety of housekeeping items.

GenerationType.generate: "nemo_skills.inference.generate",
GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
GenerationType.prover: "nemo_skills.inference.prover",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this warrants it's own generation type? I worry that having this as an option may be confusing because (a) as far as I understand this is really only for formal proofs and (b) this is a fairly specific proving workflow that might not apply to all proving. If we give it a generic type name like "prover" people may turn it on when they are doing proof generation that doesn't necessarily need this specific workflow.

If we remove from here, you just have to call the generation module specifically, e.g.,

generate(
  ...
  generation_module="nemo_skills.inference.prover",
  ...
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we don't need it in main script, let's just use as an explicit option when calling eval

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing from factory.py

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move to a file like nemo_skills/code_execution/proof_utils.py?

@@ -0,0 +1,205 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Comment on lines +122 to +125
# warning_pattern = re.compile(
# r"(/lean4/my_project/.*?:\d+:\d+: warning:.*?)(?=\n/lean4/my_project|\Z)",
# re.DOTALL,
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# warning_pattern = re.compile(
# r"(/lean4/my_project/.*?:\d+:\d+: warning:.*?)(?=\n/lean4/my_project|\Z)",
# re.DOTALL,
# )

re.DOTALL,
)
# Find all warnings and errors
# warnings = warning_pattern.findall(log_string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# warnings = warning_pattern.findall(log_string)

llm = get_model(**self.cfg.server, tokenizer=self.tokenizer)
return llm

def setup_prompt(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, this looks the same as the super method, can you confirm and delete if redundant?

generation_params["endpoint_type"] = EndpointType.text
for key, value in kwargs.items():
generation_params[key] = value
generation = await self.llm.generate_async(**generation_params)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're currently missing any kind of concurrency control. I think if you're working large datasets, this could put huge memory and network pressure on the nemo-skills process as well as the server.

We have a built-in functionality in GenerationTask to handle this called generate_with_sempahore:

async def generate_with_semaphore(self, **generation_params):
"""Generate with semaphore control.
Ensures no more than max_concurrent_requests LLM calls can be run at the same time.
Should work even if process_single_datapoint is doing multiple requests in parallel
as long as those requests also use this function.
"""
async with self.semaphore:
return await self.llm.generate_async(**generation_params)

I fear that that implementation might not be totally right for you though, since you make multiple LLM calls per sample. You would end up round-robining across all samples, rather than running some samples to completion before starting the next one... so It's probably better to roll your own concurrency control? I would take a similar approach to generate_with_sempahore, but probably use the semaphore in process_single_datapoint--this will be a little quality of life improvement in case there are any failures during generation--you will have more completed generations saved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually--after seeing that pass_at_N is independent, you probably want to do it around _single_data_point_generate.


new_results_dict = {"success": False}
for i in range(N):
results_dict = await self._single_data_point_generate(data_point, data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are independent, but they are run serially. Use a loop like this to run them concurrently:

tasks = []
for data_point in remaining_data_points:
task = asyncio.create_task(self._generate_and_save_datapoint(data_point, data, fout, pbar))
tasks.append(task)
# Wait for all tasks to complete
if tasks:
await asyncio.gather(*tasks)

}
)

async def _single_data_point_generate(self, data_point, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per this comment, wrap this whole method in with self.semaphore: https://github.com/NVIDIA-NeMo/Skills/pull/1078/files#r2593972374

You can make a helper to not have to indent this whole block if you want.


def _transform_for_nemotron_refinement(self, proof_attempt: str, error_message: str) -> list[dict]:
"""Transform multi-turn refinement into single-turn nemotron-style prompt."""
return self.nemotron_refine_prompt.fill(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider if there are any of the config args you want to pass through here (or for the other prompts, e.g., turning on thinking in chat template):

filled_prompt = self.prompt.fill(
data_point,
start_assistant_response_key=self.cfg.start_assistant_response_key,
chat_template_kwargs=self.cfg.chat_template_kwargs,
format_as_string=(self.cfg.inference.endpoint_type == EndpointType.text),
)

I realize it could get tricky here since there are multiple prompts--might have to give each of them separate args.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it takes too long to fix issues, we can also merge an intermediate version early next week and refine later as long as it doesn't break any other functionality. But this is a little time sensitive to get in

GenerationType.generate: "nemo_skills.inference.generate",
GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
GenerationType.prover: "nemo_skills.inference.prover",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we don't need it in main script, let's just use as an explicit option when calling eval

Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few comments unresolved about concurrency behavior. Don't need to be addressed now. Have an issue to track in #1083

@gwarmstrong gwarmstrong enabled auto-merge (squash) December 11, 2025 01:42
@gwarmstrong gwarmstrong merged commit 0bbba89 into main Dec 11, 2025
5 checks passed
@gwarmstrong gwarmstrong deleted the stepheng/prover-cleanup branch December 11, 2025 01:59
gwarmstrong added a commit that referenced this pull request Dec 11, 2025
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants