Conversation
This reverts commit 1c0722a. FIX multi-node pipeline creation Signed-off-by: George Armstrong <georgea@nvidia.com> remove hosntame ref change Signed-off-by: George Armstrong <georgea@nvidia.com> make param span_group_nodes Signed-off-by: George Armstrong <georgea@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces a comprehensive refactor of the NeMo Skills pipeline architecture, replacing string-based command handling with Script objects (BaseJobScript, ServerScript, SandboxScript, GenerationClientScript, EvaluatorClientScript) and adding multi-model generation support. The generate pipeline is restructured via Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
nemo_skills/pipeline/utils/scripts.py (1)
283-293: Consider validatingenv_overridesformat.The parsing at line 288-289 assumes each override contains
=. If a malformed entry is passed (e.g.,"INVALID"),split("=", 1)returns a single-element list, causingkey, value = ...to raiseValueError.🔎 Proposed defensive handling
if self.env_overrides: for override in self.env_overrides: - key, value = override.split("=", 1) - env[key] = value + if "=" not in override: + raise ValueError(f"Invalid env_override format '{override}': expected KEY=VALUE") + key, value = override.split("=", 1) + env[key] = valuetests/test_nemo_evaluator_pipeline.py (1)
20-27: Minor: Consolidate imports from the same module.The imports from
nemo_skills.pipeline.nemo_evaluatorare split across two separate import statements unnecessarily.🔎 Suggested consolidation
-from nemo_skills.pipeline.nemo_evaluator import ( - EvaluatorClientScript, -) -from nemo_skills.pipeline.nemo_evaluator import ( - nemo_evaluator as nemo_evaluator_fn, -) +from nemo_skills.pipeline.nemo_evaluator import ( + EvaluatorClientScript, + nemo_evaluator as nemo_evaluator_fn, +)nemo_skills/pipeline/generate.py (2)
95-95: Unused loop variablemodel_path.The variable
model_pathis defined in the loop but not used. Since the model path is obtained fromserver_config["model_path"]on line 111, this outer variable is redundant.🔎 Suggested fix
- for model_idx, (model_path, server_config) in enumerate(zip(models, server_configs)): + for model_idx, (_, server_config) in enumerate(zip(models, server_configs, strict=True)):
91-94: Consider addingstrict=Truetozip()for safety.If
modelsandserver_configshave mismatched lengths,zip()will silently truncate to the shorter list, potentially causing confusing behavior. Addingstrict=Truewill raise aValueErrorfor length mismatches.🔎 Suggested fix
- for model_idx, (model_path, server_config) in enumerate(zip(models, server_configs)): + for model_idx, (_, server_config) in enumerate(zip(models, server_configs, strict=True)):nemo_skills/pipeline/utils/declarative.py (1)
223-256: Unusedcluster_configparameter.The
cluster_configparameter is declared but never used in the method body. This appears to be a remnant from the refactoring. Consider removing it if not needed, or adding a comment if reserved for future use.🔎 Suggested fix
- def prepare_for_execution(self, cluster_config: Dict) -> Tuple[run.Script, Dict]: - """Prepare script for execution. - - This method: - 1. Evaluates lazy commands (if script.inline is callable) - 2. Builds execution config from Script fields - - Returns: - Tuple of (Script_object, execution_config) - """ + def prepare_for_execution(self) -> Tuple[run.Script, Dict]: + """Prepare script for execution. + + This method: + 1. Evaluates lazy commands (if script.inline is callable) + 2. Builds execution config from Script fields + + Returns: + Tuple of (Script_object, execution_config) + """If removing the parameter, also update the call site at line 501:
- script, exec_config = self._prepare_command(command, cluster_config) + script, exec_config = self._prepare_command(command)And
_prepare_commandat line 495:- def _prepare_command(self, command, cluster_config: Dict) -> Tuple[run.Script, Dict]: + def _prepare_command(self, command) -> Tuple[run.Script, Dict]:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/workflows/gpu_tests.yml(1 hunks)nemo_skills/pipeline/generate.py(9 hunks)nemo_skills/pipeline/nemo_evaluator.py(8 hunks)nemo_skills/pipeline/utils/__init__.py(1 hunks)nemo_skills/pipeline/utils/declarative.py(11 hunks)nemo_skills/pipeline/utils/generation.py(5 hunks)nemo_skills/pipeline/utils/scripts.py(1 hunks)tests/gpu-tests/test_eval.py(1 hunks)tests/test_declarative_pipeline.py(25 hunks)tests/test_generation.py(2 hunks)tests/test_nemo_evaluator_pipeline.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/test_declarative_pipeline.py (2)
nemo_skills/pipeline/utils/scripts.py (2)
set_inline(104-106)hostname_ref(108-122)nemo_skills/pipeline/utils/declarative.py (4)
Command(211-259)prepare_for_execution(223-256)get_name(258-259)run(356-493)
nemo_skills/pipeline/utils/__init__.py (1)
nemo_skills/pipeline/utils/generation.py (2)
normalize_models_config(30-59)normalize_parameter(62-102)
nemo_skills/pipeline/utils/scripts.py (6)
nemo_skills/pipeline/utils/commands.py (1)
sandbox_command(77-111)nemo_skills/pipeline/utils/exp.py (1)
install_packages_wrap(368-408)nemo_skills/pipeline/utils/generation.py (1)
get_generation_cmd(360-494)nemo_skills/pipeline/utils/server.py (2)
get_free_port(43-59)get_server_command(114-227)nemo_skills/utils.py (1)
get_logger_name(39-43)tests/test_declarative_pipeline.py (2)
set_inline(39-40)hostname_ref(42-45)
nemo_skills/pipeline/generate.py (4)
nemo_skills/pipeline/utils/scripts.py (2)
GenerationClientScript(297-428)ServerScript(126-223)nemo_skills/pipeline/utils/declarative.py (3)
CommandGroup(273-286)Command(211-259)HardwareConfig(263-270)nemo_skills/pipeline/utils/server.py (2)
SupportedServers(32-40)should_get_random_port(62-63)nemo_skills/pipeline/utils/generation.py (3)
normalize_models_config(30-59)normalize_parameter(62-102)configure_client(519-579)
tests/test_generation.py (2)
nemo_skills/pipeline/generate.py (2)
generate(208-645)_create_job_unified(50-203)nemo_skills/pipeline/utils/scripts.py (1)
ServerScript(126-223)
tests/test_nemo_evaluator_pipeline.py (3)
nemo_skills/pipeline/nemo_evaluator.py (2)
nemo_evaluator(113-421)EvaluatorClientScript(732-781)nemo_skills/pipeline/utils/declarative.py (2)
Command(211-259)CommandGroup(273-286)nemo_skills/pipeline/utils/scripts.py (1)
ServerScript(126-223)
🪛 Ruff (0.14.8)
tests/test_declarative_pipeline.py
587-587: Probable insecure usage of temporary file or directory: "/tmp/logs"
(S108)
590-590: Probable insecure usage of temporary file or directory: "/tmp/logs"
(S108)
676-676: Probable insecure usage of temporary file or directory: "/tmp/logs"
(S108)
nemo_skills/pipeline/utils/generation.py
50-50: Avoid specifying long messages outside the exception class
(TRY003)
58-58: Avoid specifying long messages outside the exception class
(TRY003)
99-102: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/pipeline/utils/declarative.py
223-223: Unused method argument: cluster_config
(ARG002)
nemo_skills/pipeline/generate.py
95-95: Loop control variable model_path not used within loop body
Rename unused model_path to _model_path
(B007)
95-95: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
234-238: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
239-243: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
244-248: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
249-253: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
254-258: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
259-263: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
264-268: Do not perform function call typer.Option in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
422-424: Avoid specifying long messages outside the exception class
(TRY003)
tests/test_generation.py
174-174: Probable insecure usage of temporary file or directory: "/tmp/out"
(S108)
186-186: Probable insecure usage of temporary file or directory: "/tmp/logs"
(S108)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: gpu-tests-qwen
- GitHub Check: unit-tests
🔇 Additional comments (31)
tests/gpu-tests/test_eval.py (1)
46-46: LGTM!Dataset exclusion updated from "aalcr" to "mrcr". The change aligns with the stated update to dataset exclusions.
.github/workflows/gpu_tests.yml (1)
55-63: LGTM!The heartbeat mechanism is a good addition for long-running GPU tests to prevent CI timeouts due to lack of output. The implementation correctly:
- Captures the heartbeat PID for cleanup
- Preserves the test script's exit code
- Gracefully handles heartbeat cleanup with
|| true- Propagates the correct exit status
nemo_skills/pipeline/utils/scripts.py (2)
62-123: LGTM!
BaseJobScriptprovides a clean abstraction for heterogeneous job support with:
- Proper handling of both callable and string
inlinecommands in__post_init__- Correct use of
object.__setattr__for frozen dataclass mutation- Well-documented
hostname_ref()with appropriate localhost fallback
125-223: LGTM!
ServerScriptcorrectly encapsulates server configuration with automatic port allocation and proper command building. Theget_address()method provides clean cross-component communication support.nemo_skills/pipeline/utils/__init__.py (1)
52-53: LGTM!New normalization utilities
normalize_models_configandnormalize_parameterare correctly exported, providing convenient access to multi-model configuration helpers through theutilspackage namespace.tests/test_generation.py (2)
23-24: LGTM!Imports correctly updated to use the new
_create_job_unifiedAPI andServerScriptclass.
156-193: LGTM!Test correctly updated to use the new
_create_job_unifiedAPI with proper validation of:
ServerScriptinstance typenum_tasksandnum_gpuspropagation- Hardware configuration alignment
The static analysis hints about
/tmp/outand/tmp/logsare false positives for test code where these are placeholder paths for the test configuration, not actual security-sensitive operations.tests/test_nemo_evaluator_pipeline.py (2)
96-147: Test coverage for external URLs path looks good.The test properly verifies the Pipeline structure when using external URLs without hosted servers, including checking that the client command uses
EvaluatorClientScript.
149-204: Main server hosted test correctly validates ServerScript properties.The test properly verifies:
ServerScriptinstance typenum_gpus,log_prefix, andportattributes- Client uses
EvaluatorClientScriptwith callable inline for cross-component refstests/test_declarative_pipeline.py (3)
30-46: DummyScript implementation is appropriate for testing.The
DummyScriptclass properly mirrors theBaseJobScriptinterface with:
set_inlinemethod for command updateshostname_refmethod for cross-component communicationhet_group_indexattribute for heterogeneous job supportThis provides good test isolation without requiring the full script infrastructure.
48-51: Helper functionmake_commandimproves test readability.Good addition that reduces boilerplate and ensures consistent
Commandconstruction withDummyScriptinstances across tests.
891-922: Environment variable capture test validates sandbox/client integration.The test properly verifies that:
- Sandbox receives
LISTEN_PORTandNGINX_PORT- Client receives
NEMO_SKILLS_SANDBOX_PORT- Ports match between sandbox and client
This ensures the cross-component environment propagation works correctly with the new Script-based architecture.
nemo_skills/pipeline/utils/generation.py (2)
30-59:normalize_models_configcorrectly handles model input normalization.The function properly:
- Raises
ValueErrorforNoneor empty inputs- Converts scalar strings to single-element lists
- Passes through existing lists
62-102:normalize_parameterbroadcast logic is correct.The function implements standard broadcast semantics:
- Scalar → broadcast to all models
- Single-element list → broadcast to all models
- Multi-element list → must match
num_modelsexactlyThe error message clearly explains the expected values.
nemo_skills/pipeline/nemo_evaluator.py (4)
292-293: Good: Dict conversion for serialization compatibility.Converting
launcher_run_cfgandtask_cfgto plain dicts viaOmegaConf.to_container()ensures compatibility with nemo-run/fiddle serialization, which cannot handle OmegaConf objects directly.
471-495:ServerScriptintegration in_create_serving_command_objlooks correct.The function properly:
- Creates
ServerScriptwith all required parameters- Sets
log_prefixto"judge-server"for judge servers- Returns a
Commandwrapping the script
731-781:EvaluatorClientScriptcorrectly implements lazy command building.The
build_commandclosure:
- Resolves server URLs using
hostname_ref()(returns shell variable for runtime resolution)- Adds wait commands for server health endpoints
- Builds the task command with URL overrides
- Returns
(command, {"environment": env_vars})tuple forprepare_for_executionThis pattern enables cross-component communication in heterogeneous jobs where hostnames aren't known until runtime.
692-694: Dict-to-DictConfig conversion in_build_task_cmdis correct.Since
launcher_run_cfgandtask_cfgare stored as plain dicts for serialization, converting back toDictConfighere enables OmegaConf operations (OmegaConf.update) as needed.nemo_skills/pipeline/generate.py (4)
136-183: Client and sandbox correctly placed in group 0 only.The logic ensures:
- Only
model_idx == 0group contains the client and optional sandbox- Client's
GenerationClientScriptreceives references to all server scripts viaservers=server_scripts- This enables cross-component hostname resolution for multi-model generation
419-424: Good validation for multi-model requirements.Requiring
generation_typeorgeneration_modulefor multi-model generation ensures the correct inference module is used to handle multiple model inputs.
611-619: Correct job spec structure for single vs multi-group.The logic properly uses:
"groups"key for multi-group jobs (heterogeneous execution)"group"key for single-group jobs (standard execution)This aligns with the
PipelineAPI that expects either key based on job type.
368-383: Updated docstring correctly documents multi-model behavior.The docstring clearly explains:
- Parameter type hints use
List[T]for Typer CLI compatibility- Both scalars and lists work when calling from Python
- Single values broadcast to all models
This helps users understand the flexible input handling.
nemo_skills/pipeline/utils/declarative.py (9)
15-41: LGTM!Imports are well-organized and appropriate for the Script-based declarative pipeline system.
43-205: LGTM!Excellent module documentation with comprehensive examples demonstrating the Script-based API, including single jobs, heterogeneous SLURM jobs, and custom Script classes.
262-286: LGTM!Clean dataclass definitions for
HardwareConfigandCommandGroupwith sensible defaults.
289-493: LGTM!The
Pipelineclass is well-structured with:
- Early validation in
_validate()for fail-fast behavior- Comprehensive dependency resolution separating internal and external dependencies
- Proper code reuse caching for experiment efficiency
568-576: LGTM!Good addition of
span_group_nodessupport. This correctly allows server scripts to span all nodes while keeping client/sandbox on the master node in multi-node setups with--overlap.
627-632: LGTM!Good design decision to pre-assign
het_group_indexvalues before evaluating any commands. This ensures cross-component references (e.g.,hostname_ref) see the correct indices regardless of processing order.
633-666: LGTM!The two-pass pattern (prepare first, then build executors) is a clean approach that ensures all scripts have resolved cross-references before executor instantiation.
712-721: LGTM!Packager sharing logic is correct—for single-group jobs, the first component's packager is reused by subsequent components to avoid redundant packaging operations.
792-832: LGTM!Clean separation with
_add_single_group_joband_add_multi_group_jobdelegating to the unified_plan_and_add_jobmethod with appropriateheterogeneousflag.
|
|
||
| # Add multi-model configuration if provided | ||
| if server_addresses is not None and model_names is not None: | ||
| num_models = len(model_names) | ||
| if num_models > 1: | ||
| # Multi-model: pass server configuration as lists | ||
| model_names_arg = ",".join(model_names) | ||
| cmd += f"++server.model=[{model_names_arg}] " | ||
|
|
||
| server_types_arg = ",".join(server_types) | ||
| cmd += f"++server.server_type=[{server_types_arg}] " | ||
|
|
||
| server_addresses_arg = ",".join(server_addresses) | ||
| cmd += f"++server.base_url=[{server_addresses_arg}] " | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && git clone https://github.com/NVIDIA/NeMo-Skills.git 2>/dev/null || cd /tmp/NeMo-Skills
# Find the file in question
find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1Repository: NVIDIA-NeMo/Skills
Length of output: 116
🏁 Script executed:
# Read the function to understand the context
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
# Get the function signature and context around lines 419-433
sed -n '380,440p' "$file_path"
fiRepository: NVIDIA-NeMo/Skills
Length of output: 2863
🏁 Script executed:
# Also search for where server_types is defined in this function
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
# Look for the function definition starting from line 1
head -420 "$file_path" | tail -100
fiRepository: NVIDIA-NeMo/Skills
Length of output: 3771
🏁 Script executed:
# Search for function signature to see parameters
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
if [ -n "$file_path" ]; then
# Find get_generation_cmd function definition
rg -n "def get_generation_cmd" "$file_path" -A 30
fiRepository: NVIDIA-NeMo/Skills
Length of output: 1369
🏁 Script executed:
# Search for all calls to get_generation_cmd with server_addresses parameter
file_path=$(find . -name "generation.py" -path "*/pipeline/utils/*" 2>/dev/null | head -1)
base_dir=$(dirname "$file_path")
rg -r -n "get_generation_cmd" --type=py "$base_dir/.." | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 575
🏁 Script executed:
# Search for calls passing server_addresses to understand usage pattern
rg -n "server_addresses\s*=" --type=py -B 2 -A 2 | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 1115
🏁 Script executed:
# Look for any calls with server_addresses argument specifically
rg -n "server_addresses=" --type=py -B 5 -A 5Repository: NVIDIA-NeMo/Skills
Length of output: 1011
🏁 Script executed:
# Get more context from scripts.py around the call
rg -n "server_addresses = None" --type=py -A 50 nemo_skills/pipeline/utils/scripts.py | head -80Repository: NVIDIA-NeMo/Skills
Length of output: 1953
🏁 Script executed:
# Check what self.model_names and self.server_types are
rg -n "self.model_names|self.server_types" --type=py nemo_skills/pipeline/utils/scripts.py -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 310
🏁 Script executed:
# Look at the class definition to understand when these are set
rg -n "class.*Script|model_names|server_types" --type=py nemo_skills/pipeline/utils/scripts.py | head -40Repository: NVIDIA-NeMo/Skills
Length of output: 859
🏁 Script executed:
# Check if there's any validation that ensures server_types is set when model_names is set
rg -n "server_types" --type=py nemo_skills/pipeline/utils/scripts.py -B 3 -A 3 | head -60Repository: NVIDIA-NeMo/Skills
Length of output: 1846
🏁 Script executed:
# Look at the full GenerationClientScript class definition
sed -n '297,380p' nemo_skills/pipeline/utils/scripts.pyRepository: NVIDIA-NeMo/Skills
Length of output: 3856
🏁 Script executed:
# Check if there's any validation elsewhere that enforces these parameters together
rg -n "__post_init__|validate|check" --type=py nemo_skills/pipeline/utils/scripts.py -A 5 | grep -A 5 "GenerationClientScript" | head -40Repository: NVIDIA-NeMo/Skills
Length of output: 285
🏁 Script executed:
# Also check generation.py for any validation after the guard check
sed -n '419,450p' nemo_skills/pipeline/utils/generation.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1304
Add server_types to the guard condition to prevent TypeError.
Line 419 checks server_addresses is not None and model_names is not None, but line 428 unconditionally calls ",".join(server_types). If server_types is None (which is allowed by the function signature), this will raise TypeError: 'NoneType' object is not iterable.
# Add multi-model configuration if provided
- if server_addresses is not None and model_names is not None:
+ if server_addresses is not None and model_names is not None and server_types is not None:
num_models = len(model_names)
if num_models > 1:
# Multi-model: pass server configuration as lists📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Add multi-model configuration if provided | |
| if server_addresses is not None and model_names is not None: | |
| num_models = len(model_names) | |
| if num_models > 1: | |
| # Multi-model: pass server configuration as lists | |
| model_names_arg = ",".join(model_names) | |
| cmd += f"++server.model=[{model_names_arg}] " | |
| server_types_arg = ",".join(server_types) | |
| cmd += f"++server.server_type=[{server_types_arg}] " | |
| server_addresses_arg = ",".join(server_addresses) | |
| cmd += f"++server.base_url=[{server_addresses_arg}] " | |
| # Add multi-model configuration if provided | |
| if server_addresses is not None and model_names is not None and server_types is not None: | |
| num_models = len(model_names) | |
| if num_models > 1: | |
| # Multi-model: pass server configuration as lists | |
| model_names_arg = ",".join(model_names) | |
| cmd += f"++server.model=[{model_names_arg}] " | |
| server_types_arg = ",".join(server_types) | |
| cmd += f"++server.server_type=[{server_types_arg}] " | |
| server_addresses_arg = ",".join(server_addresses) | |
| cmd += f"++server.base_url=[{server_addresses_arg}] " |
🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/generation.py around lines 419 to 433, the
multi-model branch joins server_types without checking it, which can raise
TypeError if server_types is None; update the guard to require server_types is
not None (e.g., if server_addresses is not None and model_names is not None and
server_types is not None) and then validate that len(server_types) ==
len(model_names) (or coerce/supply defaults) before building server_types_arg to
prevent mismatched lists and runtime errors.
| if self.servers is not None: | ||
| server_addresses = [] | ||
| for server_idx, server_script in enumerate(self.servers): | ||
| if server_script is not None: | ||
| # Self-hosted: construct address from hostname and port refs | ||
| addr = f"{server_script.hostname_ref()}:{server_script.port}" | ||
| else: | ||
| # Pre-hosted: use the address from server_addresses_prehosted | ||
| addr = self.server_addresses_prehosted[server_idx] | ||
| server_addresses.append(addr) |
There was a problem hiding this comment.
Potential IndexError/TypeError if server_addresses_prehosted is misaligned.
When servers[server_idx] is None (pre-hosted), line 400 accesses self.server_addresses_prehosted[server_idx]. If server_addresses_prehosted is None or shorter than servers, this raises an error at runtime.
🔎 Proposed defensive check
if self.servers is not None:
server_addresses = []
+ prehosted = self.server_addresses_prehosted or []
for server_idx, server_script in enumerate(self.servers):
if server_script is not None:
# Self-hosted: construct address from hostname and port refs
addr = f"{server_script.hostname_ref()}:{server_script.port}"
else:
# Pre-hosted: use the address from server_addresses_prehosted
- addr = self.server_addresses_prehosted[server_idx]
+ if server_idx >= len(prehosted) or not prehosted[server_idx]:
+ raise ValueError(
+ f"Server {server_idx} is pre-hosted but no address provided in server_addresses_prehosted"
+ )
+ addr = prehosted[server_idx]
server_addresses.append(addr)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.servers is not None: | |
| server_addresses = [] | |
| for server_idx, server_script in enumerate(self.servers): | |
| if server_script is not None: | |
| # Self-hosted: construct address from hostname and port refs | |
| addr = f"{server_script.hostname_ref()}:{server_script.port}" | |
| else: | |
| # Pre-hosted: use the address from server_addresses_prehosted | |
| addr = self.server_addresses_prehosted[server_idx] | |
| server_addresses.append(addr) | |
| if self.servers is not None: | |
| server_addresses = [] | |
| prehosted = self.server_addresses_prehosted or [] | |
| for server_idx, server_script in enumerate(self.servers): | |
| if server_script is not None: | |
| # Self-hosted: construct address from hostname and port refs | |
| addr = f"{server_script.hostname_ref()}:{server_script.port}" | |
| else: | |
| # Pre-hosted: use the address from server_addresses_prehosted | |
| if server_idx >= len(prehosted) or not prehosted[server_idx]: | |
| raise ValueError( | |
| f"Server {server_idx} is pre-hosted but no address provided in server_addresses_prehosted" | |
| ) | |
| addr = prehosted[server_idx] | |
| server_addresses.append(addr) |
🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/scripts.py around lines 392 to 401, the loop
assumes self.server_addresses_prehosted exists and has an entry for every index
where self.servers[i] is None; add a defensive check when resolving pre-hosted
addresses to ensure self.server_addresses_prehosted is not None and
len(self.server_addresses_prehosted) > server_idx (or use a safe lookup), and if
the check fails raise a clear ValueError (or skip/continue based on desired
behavior) with a descriptive message; alternatively validate and normalize
self.server_addresses_prehosted length earlier (before the loop) so the loop can
safely index it.
Greptile SummaryThis PR successfully refactors the generation pipeline to use Key improvements:
Changes validated:
The refactoring maintains backward compatibility while making the pipeline more maintainable and extensible. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Pipeline
participant Generate as _create_job_unified
participant ServerScript
participant SandboxScript
participant ClientScript
participant CommandGroup
participant Executor
User->>Pipeline: generate(model=[m1, m2], ...)
Pipeline->>Generate: _create_job_unified(models, server_configs, ...)
loop For each model
Generate->>ServerScript: new ServerScript(server_type, model_path, num_gpus, num_nodes, ...)
ServerScript->>ServerScript: Allocate port if needed
ServerScript->>ServerScript: Build server command
Note over ServerScript: span_group_nodes=True<br/>(server uses all nodes)
end
alt With Sandbox
Generate->>SandboxScript: new SandboxScript(cluster_config, keep_mounts, ...)
SandboxScript->>SandboxScript: Allocate port
SandboxScript->>SandboxScript: Build sandbox command with env vars
Note over SandboxScript: span_group_nodes=False<br/>(runs on master node only)
end
Generate->>ClientScript: new ClientScript(output_dir, servers=[server1, server2, ...], sandbox, ...)
Note over ClientScript: Cross-component references:<br/>servers list, sandbox ref
ClientScript->>ClientScript: Set inline to lazy builder (callable)
Note over ClientScript: span_group_nodes=False<br/>(runs on master node only)
loop For each model group
Generate->>CommandGroup: Create CommandGroup(commands, hardware)
Note over CommandGroup: Group 0: server1 + sandbox + client<br/>Group 1: server2 (if multi-model)
end
Generate-->>Pipeline: Return list of CommandGroups
Pipeline->>Pipeline: Assign het_group_index to all scripts
Note over Pipeline: BEFORE evaluating any commands
Pipeline->>ClientScript: Evaluate inline callable
ClientScript->>ServerScript: hostname_ref() → ${SLURM_MASTER_NODE_HET_GROUP_N}
ClientScript->>ServerScript: Get port
ClientScript->>SandboxScript: Get port
ClientScript->>ClientScript: Build generation command with server addresses
ClientScript-->>Pipeline: Command string + environment vars
Pipeline->>Executor: Create executors with hardware config
Note over Executor: num_nodes from span_group_nodes:<br/>ServerScript: group.num_nodes<br/>ClientScript/Sandbox: 1
Pipeline->>Executor: exp.add(scripts, executors, dependencies)
Executor->>Executor: Submit heterogeneous SLURM job
Note over Executor: Environment exports:<br/>SLURM_MASTER_NODE_HET_GROUP_0,<br/>SLURM_MASTER_NODE_HET_GROUP_1, ...
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Re-applies #1052 with fixes for multi-node
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.