Skip to content

Fix run.Script refactor#1133

Merged
gwarmstrong merged 4 commits intomainfrom
georgea/fix-run-script-refactor
Jan 6, 2026
Merged

Fix run.Script refactor#1133
gwarmstrong merged 4 commits intomainfrom
georgea/fix-run-script-refactor

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 18, 2025

Re-applies #1052 with fixes for multi-node

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for running generation tasks with multiple models simultaneously, enabling coordinated multi-model workflows through a single entry point.
  • Improvements

    • Enhanced GPU test reliability with heartbeat monitoring to track long-running test progress.
  • Bug Fixes

    • Updated dataset evaluation to use correct dataset identifier in test suite.

✏️ Tip: You can customize this high-level summary in your review settings.

This reverts commit 1c0722a.

FIX multi-node pipeline creation

Signed-off-by: George Armstrong <georgea@nvidia.com>

remove hosntame ref change

Signed-off-by: George Armstrong <georgea@nvidia.com>

make param span_group_nodes

Signed-off-by: George Armstrong <georgea@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive refactor of the NeMo Skills pipeline architecture, replacing string-based command handling with Script objects (BaseJobScript, ServerScript, SandboxScript, GenerationClientScript, EvaluatorClientScript) and adding multi-model generation support. The generate pipeline is restructured via _create_job_unified to handle per-model server configurations and cross-model references. The evaluator pipeline adopts the new Script-based serving and client orchestration. Declarative job planning is updated to work with Script objects and support heterogeneous multi-group execution. GPU test workflow is enhanced with a heartbeat process, and dataset exclusions are updated.

Changes

Cohort / File(s) Summary
GPU Tests Infrastructure
.github/workflows/gpu_tests.yml
GPU test runner now launches a background heartbeat process during test execution, captures its PID, and ensures cleanup and proper exit code propagation.
Script Base Classes
nemo_skills/pipeline/utils/scripts.py
New module introducing Script wrapper hierarchy: BaseJobScript (heterogeneous SLURM support, hostname resolution, installation command wrapping), ServerScript (port allocation, server command building, address exposition), SandboxScript (sandbox command/environment wrapping), GenerationClientScript (lazy runtime command building with cross-component references).
Multi-Model Generation Refactor
nemo_skills/pipeline/generate.py
Replaces _create_commandgroup_from_config with _create_job_unified supporting multi-model setups via list-type parameters (model, server_address, server_type, server_gpus, server_nodes, server_args, server_entrypoint, server_container). Adds per-model server/sandbox/client component composition, parameter normalization, and returns CommandGroup list.
Script-Based Evaluator Pipeline
nemo_skills/pipeline/nemo_evaluator.py
Replaces vLLM server command logic with ServerScript-backed serving. Introduces EvaluatorClientScript for dynamic runtime URL resolution. Updates internal functions to accept plain dicts instead of OmegaConf DictConfig for serialization compatibility.
Script-Based Declarative Job Planning
nemo_skills/pipeline/utils/declarative.py
Replaces Command.command (string/callable) with Command.script (Script object). Updates prepare_for_execution to return Script objects. Adds local path rewriting for executor="none". Introduces per-group job naming and HardwareConfig.num_tasks field. Refactors job planning to operate on Script objects throughout.
Generation Utilities
nemo_skills/pipeline/utils/generation.py
Adds normalize_models_config and normalize_parameter functions for multi-model parameter handling. Extends get_generation_cmd with server_addresses, model_names, server_types parameters for multi-model Hydra overrides.
Pipeline Utils Exports
nemo_skills/pipeline/utils/__init__.py
Exports new public functions: normalize_models_config and normalize_parameter.
Dataset Test Update
tests/gpu-tests/test_eval.py
Replaces "aalcr" with "mrcr" in EXCLUDED_DATASETS.
Test Updates
tests/test_declarative_pipeline.py, tests/test_generation.py, tests/test_nemo_evaluator_pipeline.py
Introduce DummyScript and update test expectations to verify Script instances and attributes instead of string commands. Replace assertions on metadata/commands with Script type and field checks (num_tasks, num_gpus, port, log_prefix). Update calls to _create_job_unified. Validate cross-component script references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Script inheritance hierarchy and runtime resolution: BaseJobScript's hostname_ref and port allocation mechanism across heterogeneous jobs; tight interdependencies between ServerScript, SandboxScript, and GenerationClientScript for cross-component reference resolution
  • Multi-model parameter normalization and grouping: normalize_models_config and normalize_parameter logic; _create_job_unified's per-model server/client composition and CommandGroup aggregation
  • Declarative job planning refactor: Conversion from string commands to Script objects; path rewriting for executor="none"; two-pass planning loop and per-group job naming; HardwareConfig.num_tasks integration
  • Test assertion updates: Multiple test files heavily rewritten with new Script-based expectations; verify proper migration of all test scenarios
  • Cross-file integration: Changes span generate.py, nemo_evaluator.py, and declarative.py with cascading structural dependencies

Possibly related PRs

  • Use run.Script for generate pipeline #1052: Directly related—both PRs introduce the same Script-based architecture (ServerScript, SandboxScript, GenerationClientScript/EvaluatorClientScript), refactor generate/nemo_evaluator to multi-model support, and update declarative Command to use run.Script
  • Revert "Use run.Script for generate pipeline (#1052)" #1125: Directly related—reverts this PR's core Script-based refactor and multi-model changes, undoing the run.Script architecture and restoring string-based command handling
  • Evaluation on OJBench #848: Related at code level—modifies nemo_skills/pipeline/generate.py to add keep_mounts_for_sandbox parameter, which overlaps with this PR's generate.py refactoring

Suggested reviewers

  • activatedgeek
  • Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.60% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Fix run.Script refactor' is vague and generic, using a non-descriptive term 'Fix' without clarifying what specific issue is being addressed or what the main change involves. Provide a more specific title that describes the actual changes or problem being fixed, such as 'Refactor Command and Script handling for multi-model pipeline support' or 'Update Script-based command execution for declarative pipeline.'
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/fix-run-script-refactor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
nemo_skills/pipeline/utils/scripts.py (1)

283-293: Consider validating env_overrides format.

The parsing at line 288-289 assumes each override contains =. If a malformed entry is passed (e.g., "INVALID"), split("=", 1) returns a single-element list, causing key, value = ... to raise ValueError.

🔎 Proposed defensive handling
             if self.env_overrides:
                 for override in self.env_overrides:
-                    key, value = override.split("=", 1)
-                    env[key] = value
+                    if "=" not in override:
+                        raise ValueError(f"Invalid env_override format '{override}': expected KEY=VALUE")
+                    key, value = override.split("=", 1)
+                    env[key] = value
tests/test_nemo_evaluator_pipeline.py (1)

20-27: Minor: Consolidate imports from the same module.

The imports from nemo_skills.pipeline.nemo_evaluator are split across two separate import statements unnecessarily.

🔎 Suggested consolidation
-from nemo_skills.pipeline.nemo_evaluator import (
-    EvaluatorClientScript,
-)
-from nemo_skills.pipeline.nemo_evaluator import (
-    nemo_evaluator as nemo_evaluator_fn,
-)
+from nemo_skills.pipeline.nemo_evaluator import (
+    EvaluatorClientScript,
+    nemo_evaluator as nemo_evaluator_fn,
+)
nemo_skills/pipeline/generate.py (2)

95-95: Unused loop variable model_path.

The variable model_path is defined in the loop but not used. Since the model path is obtained from server_config["model_path"] on line 111, this outer variable is redundant.

🔎 Suggested fix
-    for model_idx, (model_path, server_config) in enumerate(zip(models, server_configs)):
+    for model_idx, (_, server_config) in enumerate(zip(models, server_configs, strict=True)):

91-94: Consider adding strict=True to zip() for safety.

If models and server_configs have mismatched lengths, zip() will silently truncate to the shorter list, potentially causing confusing behavior. Adding strict=True will raise a ValueError for length mismatches.

🔎 Suggested fix
-    for model_idx, (model_path, server_config) in enumerate(zip(models, server_configs)):
+    for model_idx, (_, server_config) in enumerate(zip(models, server_configs, strict=True)):
nemo_skills/pipeline/utils/declarative.py (1)

223-256: Unused cluster_config parameter.

The cluster_config parameter is declared but never used in the method body. This appears to be a remnant from the refactoring. Consider removing it if not needed, or adding a comment if reserved for future use.

🔎 Suggested fix
-    def prepare_for_execution(self, cluster_config: Dict) -> Tuple[run.Script, Dict]:
-        """Prepare script for execution.
-
-        This method:
-        1. Evaluates lazy commands (if script.inline is callable)
-        2. Builds execution config from Script fields
-
-        Returns:
-            Tuple of (Script_object, execution_config)
-        """
+    def prepare_for_execution(self) -> Tuple[run.Script, Dict]:
+        """Prepare script for execution.
+
+        This method:
+        1. Evaluates lazy commands (if script.inline is callable)
+        2. Builds execution config from Script fields
+
+        Returns:
+            Tuple of (Script_object, execution_config)
+        """

If removing the parameter, also update the call site at line 501:

-                script, exec_config = self._prepare_command(command, cluster_config)
+                script, exec_config = self._prepare_command(command)

And _prepare_command at line 495:

-    def _prepare_command(self, command, cluster_config: Dict) -> Tuple[run.Script, Dict]:
+    def _prepare_command(self, command) -> Tuple[run.Script, Dict]:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c01c0a7 and 4459850.

📒 Files selected for processing (11)
  • .github/workflows/gpu_tests.yml (1 hunks)
  • nemo_skills/pipeline/generate.py (9 hunks)
  • nemo_skills/pipeline/nemo_evaluator.py (8 hunks)
  • nemo_skills/pipeline/utils/__init__.py (1 hunks)
  • nemo_skills/pipeline/utils/declarative.py (11 hunks)
  • nemo_skills/pipeline/utils/generation.py (5 hunks)
  • nemo_skills/pipeline/utils/scripts.py (1 hunks)
  • tests/gpu-tests/test_eval.py (1 hunks)
  • tests/test_declarative_pipeline.py (25 hunks)
  • tests/test_generation.py (2 hunks)
  • tests/test_nemo_evaluator_pipeline.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/test_declarative_pipeline.py (2)
nemo_skills/pipeline/utils/scripts.py (2)
  • set_inline (104-106)
  • hostname_ref (108-122)
nemo_skills/pipeline/utils/declarative.py (4)
  • Command (211-259)
  • prepare_for_execution (223-256)
  • get_name (258-259)
  • run (356-493)
nemo_skills/pipeline/utils/__init__.py (1)
nemo_skills/pipeline/utils/generation.py (2)
  • normalize_models_config (30-59)
  • normalize_parameter (62-102)
nemo_skills/pipeline/utils/scripts.py (6)
nemo_skills/pipeline/utils/commands.py (1)
  • sandbox_command (77-111)
nemo_skills/pipeline/utils/exp.py (1)
  • install_packages_wrap (368-408)
nemo_skills/pipeline/utils/generation.py (1)
  • get_generation_cmd (360-494)
nemo_skills/pipeline/utils/server.py (2)
  • get_free_port (43-59)
  • get_server_command (114-227)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
tests/test_declarative_pipeline.py (2)
  • set_inline (39-40)
  • hostname_ref (42-45)
nemo_skills/pipeline/generate.py (4)
nemo_skills/pipeline/utils/scripts.py (2)
  • GenerationClientScript (297-428)
  • ServerScript (126-223)
nemo_skills/pipeline/utils/declarative.py (3)
  • CommandGroup (273-286)
  • Command (211-259)
  • HardwareConfig (263-270)
nemo_skills/pipeline/utils/server.py (2)
  • SupportedServers (32-40)
  • should_get_random_port (62-63)
nemo_skills/pipeline/utils/generation.py (3)
  • normalize_models_config (30-59)
  • normalize_parameter (62-102)
  • configure_client (519-579)
tests/test_generation.py (2)
nemo_skills/pipeline/generate.py (2)
  • generate (208-645)
  • _create_job_unified (50-203)
nemo_skills/pipeline/utils/scripts.py (1)
  • ServerScript (126-223)
tests/test_nemo_evaluator_pipeline.py (3)
nemo_skills/pipeline/nemo_evaluator.py (2)
  • nemo_evaluator (113-421)
  • EvaluatorClientScript (732-781)
nemo_skills/pipeline/utils/declarative.py (2)
  • Command (211-259)
  • CommandGroup (273-286)
nemo_skills/pipeline/utils/scripts.py (1)
  • ServerScript (126-223)
🪛 Ruff (0.14.8)
tests/test_declarative_pipeline.py

587-587: Probable insecure usage of temporary file or directory: "/tmp/logs"

(S108)


590-590: Probable insecure usage of temporary file or directory: "/tmp/logs"

(S108)


676-676: Probable insecure usage of temporary file or directory: "/tmp/logs"

(S108)

nemo_skills/pipeline/utils/generation.py

50-50: Avoid specifying long messages outside the exception class

(TRY003)


58-58: Avoid specifying long messages outside the exception class

(TRY003)


99-102: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/pipeline/utils/declarative.py

223-223: Unused method argument: cluster_config

(ARG002)

nemo_skills/pipeline/generate.py

95-95: Loop control variable model_path not used within loop body

Rename unused model_path to _model_path

(B007)


95-95: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


234-238: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


239-243: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


244-248: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


249-253: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


254-258: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


259-263: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


264-268: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


422-424: Avoid specifying long messages outside the exception class

(TRY003)

tests/test_generation.py

174-174: Probable insecure usage of temporary file or directory: "/tmp/out"

(S108)


186-186: Probable insecure usage of temporary file or directory: "/tmp/logs"

(S108)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: gpu-tests-qwen
  • GitHub Check: unit-tests
🔇 Additional comments (31)
tests/gpu-tests/test_eval.py (1)

46-46: LGTM!

Dataset exclusion updated from "aalcr" to "mrcr". The change aligns with the stated update to dataset exclusions.

.github/workflows/gpu_tests.yml (1)

55-63: LGTM!

The heartbeat mechanism is a good addition for long-running GPU tests to prevent CI timeouts due to lack of output. The implementation correctly:

  • Captures the heartbeat PID for cleanup
  • Preserves the test script's exit code
  • Gracefully handles heartbeat cleanup with || true
  • Propagates the correct exit status
nemo_skills/pipeline/utils/scripts.py (2)

62-123: LGTM!

BaseJobScript provides a clean abstraction for heterogeneous job support with:

  • Proper handling of both callable and string inline commands in __post_init__
  • Correct use of object.__setattr__ for frozen dataclass mutation
  • Well-documented hostname_ref() with appropriate localhost fallback

125-223: LGTM!

ServerScript correctly encapsulates server configuration with automatic port allocation and proper command building. The get_address() method provides clean cross-component communication support.

nemo_skills/pipeline/utils/__init__.py (1)

52-53: LGTM!

New normalization utilities normalize_models_config and normalize_parameter are correctly exported, providing convenient access to multi-model configuration helpers through the utils package namespace.

tests/test_generation.py (2)

23-24: LGTM!

Imports correctly updated to use the new _create_job_unified API and ServerScript class.


156-193: LGTM!

Test correctly updated to use the new _create_job_unified API with proper validation of:

  • ServerScript instance type
  • num_tasks and num_gpus propagation
  • Hardware configuration alignment

The static analysis hints about /tmp/out and /tmp/logs are false positives for test code where these are placeholder paths for the test configuration, not actual security-sensitive operations.

tests/test_nemo_evaluator_pipeline.py (2)

96-147: Test coverage for external URLs path looks good.

The test properly verifies the Pipeline structure when using external URLs without hosted servers, including checking that the client command uses EvaluatorClientScript.


149-204: Main server hosted test correctly validates ServerScript properties.

The test properly verifies:

  • ServerScript instance type
  • num_gpus, log_prefix, and port attributes
  • Client uses EvaluatorClientScript with callable inline for cross-component refs
tests/test_declarative_pipeline.py (3)

30-46: DummyScript implementation is appropriate for testing.

The DummyScript class properly mirrors the BaseJobScript interface with:

  • set_inline method for command updates
  • hostname_ref method for cross-component communication
  • het_group_index attribute for heterogeneous job support

This provides good test isolation without requiring the full script infrastructure.


48-51: Helper function make_command improves test readability.

Good addition that reduces boilerplate and ensures consistent Command construction with DummyScript instances across tests.


891-922: Environment variable capture test validates sandbox/client integration.

The test properly verifies that:

  • Sandbox receives LISTEN_PORT and NGINX_PORT
  • Client receives NEMO_SKILLS_SANDBOX_PORT
  • Ports match between sandbox and client

This ensures the cross-component environment propagation works correctly with the new Script-based architecture.

nemo_skills/pipeline/utils/generation.py (2)

30-59: normalize_models_config correctly handles model input normalization.

The function properly:

  • Raises ValueError for None or empty inputs
  • Converts scalar strings to single-element lists
  • Passes through existing lists

62-102: normalize_parameter broadcast logic is correct.

The function implements standard broadcast semantics:

  • Scalar → broadcast to all models
  • Single-element list → broadcast to all models
  • Multi-element list → must match num_models exactly

The error message clearly explains the expected values.

nemo_skills/pipeline/nemo_evaluator.py (4)

292-293: Good: Dict conversion for serialization compatibility.

Converting launcher_run_cfg and task_cfg to plain dicts via OmegaConf.to_container() ensures compatibility with nemo-run/fiddle serialization, which cannot handle OmegaConf objects directly.


471-495: ServerScript integration in _create_serving_command_obj looks correct.

The function properly:

  • Creates ServerScript with all required parameters
  • Sets log_prefix to "judge-server" for judge servers
  • Returns a Command wrapping the script

731-781: EvaluatorClientScript correctly implements lazy command building.

The build_command closure:

  1. Resolves server URLs using hostname_ref() (returns shell variable for runtime resolution)
  2. Adds wait commands for server health endpoints
  3. Builds the task command with URL overrides
  4. Returns (command, {"environment": env_vars}) tuple for prepare_for_execution

This pattern enables cross-component communication in heterogeneous jobs where hostnames aren't known until runtime.


692-694: Dict-to-DictConfig conversion in _build_task_cmd is correct.

Since launcher_run_cfg and task_cfg are stored as plain dicts for serialization, converting back to DictConfig here enables OmegaConf operations (OmegaConf.update) as needed.

nemo_skills/pipeline/generate.py (4)

136-183: Client and sandbox correctly placed in group 0 only.

The logic ensures:

  • Only model_idx == 0 group contains the client and optional sandbox
  • Client's GenerationClientScript receives references to all server scripts via servers=server_scripts
  • This enables cross-component hostname resolution for multi-model generation

419-424: Good validation for multi-model requirements.

Requiring generation_type or generation_module for multi-model generation ensures the correct inference module is used to handle multiple model inputs.


611-619: Correct job spec structure for single vs multi-group.

The logic properly uses:

  • "groups" key for multi-group jobs (heterogeneous execution)
  • "group" key for single-group jobs (standard execution)

This aligns with the Pipeline API that expects either key based on job type.


368-383: Updated docstring correctly documents multi-model behavior.

The docstring clearly explains:

  • Parameter type hints use List[T] for Typer CLI compatibility
  • Both scalars and lists work when calling from Python
  • Single values broadcast to all models

This helps users understand the flexible input handling.

nemo_skills/pipeline/utils/declarative.py (9)

15-41: LGTM!

Imports are well-organized and appropriate for the Script-based declarative pipeline system.


43-205: LGTM!

Excellent module documentation with comprehensive examples demonstrating the Script-based API, including single jobs, heterogeneous SLURM jobs, and custom Script classes.


262-286: LGTM!

Clean dataclass definitions for HardwareConfig and CommandGroup with sensible defaults.


289-493: LGTM!

The Pipeline class is well-structured with:

  • Early validation in _validate() for fail-fast behavior
  • Comprehensive dependency resolution separating internal and external dependencies
  • Proper code reuse caching for experiment efficiency

568-576: LGTM!

Good addition of span_group_nodes support. This correctly allows server scripts to span all nodes while keeping client/sandbox on the master node in multi-node setups with --overlap.


627-632: LGTM!

Good design decision to pre-assign het_group_index values before evaluating any commands. This ensures cross-component references (e.g., hostname_ref) see the correct indices regardless of processing order.


633-666: LGTM!

The two-pass pattern (prepare first, then build executors) is a clean approach that ensures all scripts have resolved cross-references before executor instantiation.


712-721: LGTM!

Packager sharing logic is correct—for single-group jobs, the first component's packager is reused by subsequent components to avoid redundant packaging operations.


792-832: LGTM!

Clean separation with _add_single_group_job and _add_multi_group_job delegating to the unified _plan_and_add_job method with appropriate heterogeneous flag.

Comment on lines +419 to +433

# Add multi-model configuration if provided
if server_addresses is not None and model_names is not None:
num_models = len(model_names)
if num_models > 1:
# Multi-model: pass server configuration as lists
model_names_arg = ",".join(model_names)
cmd += f"++server.model=[{model_names_arg}] "

server_types_arg = ",".join(server_types)
cmd += f"++server.server_type=[{server_types_arg}] "

server_addresses_arg = ",".join(server_addresses)
cmd += f"++server.base_url=[{server_addresses_arg}] "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && git clone https://github.com/NVIDIA/NeMo-Skills.git 2>/dev/null || cd /tmp/NeMo-Skills
# Find the file in question
find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1

Repository: NVIDIA-NeMo/Skills

Length of output: 116


🏁 Script executed:

# Read the function to understand the context
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
  # Get the function signature and context around lines 419-433
  sed -n '380,440p' "$file_path"
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 2863


🏁 Script executed:

# Also search for where server_types is defined in this function
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
  # Look for the function definition starting from line 1
  head -420 "$file_path" | tail -100
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 3771


🏁 Script executed:

# Search for function signature to see parameters
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
  # Find get_generation_cmd function definition
  rg -n "def get_generation_cmd" "$file_path" -A 30
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 1369


🏁 Script executed:

# Search for all calls to get_generation_cmd with server_addresses parameter
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
base_dir=$(dirname "$file_path")
rg -r -n "get_generation_cmd" --type=py "$base_dir/.." | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 575


🏁 Script executed:

# Search for calls passing server_addresses to understand usage pattern
rg -n "server_addresses\s*=" --type=py -B 2 -A 2 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 1115


🏁 Script executed:

# Look for any calls with server_addresses argument specifically
rg -n "server_addresses=" --type=py -B 5 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 1011


🏁 Script executed:

# Get more context from scripts.py around the call
rg -n "server_addresses = None" --type=py -A 50 nemo_skills/pipeline/utils/scripts.py | head -80

Repository: NVIDIA-NeMo/Skills

Length of output: 1953


🏁 Script executed:

# Check what self.model_names and self.server_types are
rg -n "self.model_names|self.server_types" --type=py nemo_skills/pipeline/utils/scripts.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 310


🏁 Script executed:

# Look at the class definition to understand when these are set
rg -n "class.*Script|model_names|server_types" --type=py nemo_skills/pipeline/utils/scripts.py | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 859


🏁 Script executed:

# Check if there's any validation that ensures server_types is set when model_names is set
rg -n "server_types" --type=py nemo_skills/pipeline/utils/scripts.py -B 3 -A 3 | head -60

Repository: NVIDIA-NeMo/Skills

Length of output: 1846


🏁 Script executed:

# Look at the full GenerationClientScript class definition
sed -n '297,380p' nemo_skills/pipeline/utils/scripts.py

Repository: NVIDIA-NeMo/Skills

Length of output: 3856


🏁 Script executed:

# Check if there's any validation elsewhere that enforces these parameters together
rg -n "__post_init__|validate|check" --type=py nemo_skills/pipeline/utils/scripts.py -A 5 | grep -A 5 "GenerationClientScript" | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 285


🏁 Script executed:

# Also check generation.py for any validation after the guard check
sed -n '419,450p' nemo_skills/pipeline/utils/generation.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1304


Add server_types to the guard condition to prevent TypeError.

Line 419 checks server_addresses is not None and model_names is not None, but line 428 unconditionally calls ",".join(server_types). If server_types is None (which is allowed by the function signature), this will raise TypeError: 'NoneType' object is not iterable.

    # Add multi-model configuration if provided
-    if server_addresses is not None and model_names is not None:
+    if server_addresses is not None and model_names is not None and server_types is not None:
        num_models = len(model_names)
        if num_models > 1:
            # Multi-model: pass server configuration as lists
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add multi-model configuration if provided
if server_addresses is not None and model_names is not None:
num_models = len(model_names)
if num_models > 1:
# Multi-model: pass server configuration as lists
model_names_arg = ",".join(model_names)
cmd += f"++server.model=[{model_names_arg}] "
server_types_arg = ",".join(server_types)
cmd += f"++server.server_type=[{server_types_arg}] "
server_addresses_arg = ",".join(server_addresses)
cmd += f"++server.base_url=[{server_addresses_arg}] "
# Add multi-model configuration if provided
if server_addresses is not None and model_names is not None and server_types is not None:
num_models = len(model_names)
if num_models > 1:
# Multi-model: pass server configuration as lists
model_names_arg = ",".join(model_names)
cmd += f"++server.model=[{model_names_arg}] "
server_types_arg = ",".join(server_types)
cmd += f"++server.server_type=[{server_types_arg}] "
server_addresses_arg = ",".join(server_addresses)
cmd += f"++server.base_url=[{server_addresses_arg}] "
🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/generation.py around lines 419 to 433, the
multi-model branch joins server_types without checking it, which can raise
TypeError if server_types is None; update the guard to require server_types is
not None (e.g., if server_addresses is not None and model_names is not None and
server_types is not None) and then validate that len(server_types) ==
len(model_names) (or coerce/supply defaults) before building server_types_arg to
prevent mismatched lists and runtime errors.

Comment on lines +392 to +401
if self.servers is not None:
server_addresses = []
for server_idx, server_script in enumerate(self.servers):
if server_script is not None:
# Self-hosted: construct address from hostname and port refs
addr = f"{server_script.hostname_ref()}:{server_script.port}"
else:
# Pre-hosted: use the address from server_addresses_prehosted
addr = self.server_addresses_prehosted[server_idx]
server_addresses.append(addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError/TypeError if server_addresses_prehosted is misaligned.

When servers[server_idx] is None (pre-hosted), line 400 accesses self.server_addresses_prehosted[server_idx]. If server_addresses_prehosted is None or shorter than servers, this raises an error at runtime.

🔎 Proposed defensive check
             if self.servers is not None:
                 server_addresses = []
+                prehosted = self.server_addresses_prehosted or []
                 for server_idx, server_script in enumerate(self.servers):
                     if server_script is not None:
                         # Self-hosted: construct address from hostname and port refs
                         addr = f"{server_script.hostname_ref()}:{server_script.port}"
                     else:
                         # Pre-hosted: use the address from server_addresses_prehosted
-                        addr = self.server_addresses_prehosted[server_idx]
+                        if server_idx >= len(prehosted) or not prehosted[server_idx]:
+                            raise ValueError(
+                                f"Server {server_idx} is pre-hosted but no address provided in server_addresses_prehosted"
+                            )
+                        addr = prehosted[server_idx]
                     server_addresses.append(addr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.servers is not None:
server_addresses = []
for server_idx, server_script in enumerate(self.servers):
if server_script is not None:
# Self-hosted: construct address from hostname and port refs
addr = f"{server_script.hostname_ref()}:{server_script.port}"
else:
# Pre-hosted: use the address from server_addresses_prehosted
addr = self.server_addresses_prehosted[server_idx]
server_addresses.append(addr)
if self.servers is not None:
server_addresses = []
prehosted = self.server_addresses_prehosted or []
for server_idx, server_script in enumerate(self.servers):
if server_script is not None:
# Self-hosted: construct address from hostname and port refs
addr = f"{server_script.hostname_ref()}:{server_script.port}"
else:
# Pre-hosted: use the address from server_addresses_prehosted
if server_idx >= len(prehosted) or not prehosted[server_idx]:
raise ValueError(
f"Server {server_idx} is pre-hosted but no address provided in server_addresses_prehosted"
)
addr = prehosted[server_idx]
server_addresses.append(addr)
🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/scripts.py around lines 392 to 401, the loop
assumes self.server_addresses_prehosted exists and has an entry for every index
where self.servers[i] is None; add a defensive check when resolving pre-hosted
addresses to ensure self.server_addresses_prehosted is not None and
len(self.server_addresses_prehosted) > server_idx (or use a safe lookup), and if
the check fails raise a clear ValueError (or skip/continue based on desired
behavior) with a descriptive message; alternatively validate and normalize
self.server_addresses_prehosted length earlier (before the loop) so the loop can
safely index it.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

This PR successfully refactors the generation pipeline to use run.Script dataclasses, fixing multi-node issues from #1052. The refactoring introduces three key Script classes (ServerScript, SandboxScript, GenerationClientScript) that encapsulate command building logic with explicit fields and cross-component references.

Key improvements:

  • Lazy command evaluation ensures het_group_index is assigned before hostname resolution
  • span_group_nodes flag properly controls node allocation (servers span all nodes, clients/sandbox run on master only)
  • Multi-model support with heterogeneous SLURM jobs works correctly
  • Cross-component references (server hostnames, ports) resolve at runtime via environment variables
  • Resource requirements (num_gpus, num_nodes, num_tasks) centralized in HardwareConfig and Script objects

Changes validated:

  • Comprehensive test coverage updated for Script-based architecture
  • GPU test reliability improved with heartbeat monitoring
  • OmegaConf serialization issue fixed in nemo_evaluator (converts to plain dicts)

The refactoring maintains backward compatibility while making the pipeline more maintainable and extensible.

Confidence Score: 5/5

  • Safe to merge - well-tested refactoring with comprehensive test coverage and improved architecture
  • This is a large but well-structured refactoring that addresses the multi-node issues from Use run.Script for generate pipeline #1052. The code introduces clear abstractions (Script dataclasses), maintains test coverage, and includes proper safeguards for multi-node execution. The lazy command evaluation pattern ensures correct hostname resolution, and the span_group_nodes flag prevents incorrect node allocation. All tests have been updated to reflect the new architecture.
  • No files require special attention

Important Files Changed

Filename Overview
nemo_skills/pipeline/utils/scripts.py New file introducing Script dataclasses (ServerScript, SandboxScript, GenerationClientScript) with lazy command building, cross-component references, and multi-node support via span_group_nodes
nemo_skills/pipeline/utils/declarative.py Refactored to use Script objects instead of raw commands, with het_group_index assignment before command evaluation and span_group_nodes support for multi-node servers
nemo_skills/pipeline/generate.py Refactored _create_job_unified to use Script objects, creates CommandGroups with cross-component references for multi-model generation
nemo_skills/pipeline/utils/generation.py Added multi-model support with normalize_models_config and normalize_parameter functions, updated get_generation_cmd for multi-model server lists
nemo_skills/pipeline/nemo_evaluator.py Refactored to use ServerScript and EvaluatorClientScript, converts OmegaConf to plain dicts for nemo_run serialization compatibility
tests/test_declarative_pipeline.py Updated tests to use Script-based Command interface, added tests for het_group_index assignment and hostname_ref resolution

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant Generate as _create_job_unified
    participant ServerScript
    participant SandboxScript
    participant ClientScript
    participant CommandGroup
    participant Executor

    User->>Pipeline: generate(model=[m1, m2], ...)
    Pipeline->>Generate: _create_job_unified(models, server_configs, ...)
    
    loop For each model
        Generate->>ServerScript: new ServerScript(server_type, model_path, num_gpus, num_nodes, ...)
        ServerScript->>ServerScript: Allocate port if needed
        ServerScript->>ServerScript: Build server command
        Note over ServerScript: span_group_nodes=True<br/>(server uses all nodes)
    end
    
    alt With Sandbox
        Generate->>SandboxScript: new SandboxScript(cluster_config, keep_mounts, ...)
        SandboxScript->>SandboxScript: Allocate port
        SandboxScript->>SandboxScript: Build sandbox command with env vars
        Note over SandboxScript: span_group_nodes=False<br/>(runs on master node only)
    end
    
    Generate->>ClientScript: new ClientScript(output_dir, servers=[server1, server2, ...], sandbox, ...)
    Note over ClientScript: Cross-component references:<br/>servers list, sandbox ref
    ClientScript->>ClientScript: Set inline to lazy builder (callable)
    Note over ClientScript: span_group_nodes=False<br/>(runs on master node only)
    
    loop For each model group
        Generate->>CommandGroup: Create CommandGroup(commands, hardware)
        Note over CommandGroup: Group 0: server1 + sandbox + client<br/>Group 1: server2 (if multi-model)
    end
    
    Generate-->>Pipeline: Return list of CommandGroups
    
    Pipeline->>Pipeline: Assign het_group_index to all scripts
    Note over Pipeline: BEFORE evaluating any commands
    
    Pipeline->>ClientScript: Evaluate inline callable
    ClientScript->>ServerScript: hostname_ref() → ${SLURM_MASTER_NODE_HET_GROUP_N}
    ClientScript->>ServerScript: Get port
    ClientScript->>SandboxScript: Get port
    ClientScript->>ClientScript: Build generation command with server addresses
    ClientScript-->>Pipeline: Command string + environment vars
    
    Pipeline->>Executor: Create executors with hardware config
    Note over Executor: num_nodes from span_group_nodes:<br/>ServerScript: group.num_nodes<br/>ClientScript/Sandbox: 1
    
    Pipeline->>Executor: exp.add(scripts, executors, dependencies)
    Executor->>Executor: Submit heterogeneous SLURM job
    Note over Executor: Environment exports:<br/>SLURM_MASTER_NODE_HET_GROUP_0,<br/>SLURM_MASTER_NODE_HET_GROUP_1, ...
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@gwarmstrong gwarmstrong merged commit 0f94f5e into main Jan 6, 2026
7 checks passed
@gwarmstrong gwarmstrong deleted the georgea/fix-run-script-refactor branch January 6, 2026 22:23
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
This was referenced Mar 5, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant