Add multi-instance pipeline support (gpus_per_node) by vmendelev · Pull Request #1247 · NVIDIA-NeMo/Skills

vmendelev · 2026-02-18T15:03:29Z

Summary

Add gpus_per_node multi-instance mode for running multiple model instances per node
Dynamic port computation based on SLURM_LOCALID for multi-instance jobs
Add srun --wait configuration to avoid premature killing of multi-instance tasks
Set default multi-instance srun wait to 1 hour
Add EOS cluster config example

Files changed

nemo_skills/pipeline/eval.py — pass gpus_per_node through eval pipeline
nemo_skills/pipeline/utils/eval.py — multi-instance chunking logic
nemo_skills/pipeline/utils/exp.py — srun --wait for multi-instance jobs
nemo_skills/pipeline/utils/generation.py — SLURM_LOCALID-based port in configure_client
nemo_skills/pipeline/utils/server.py — multi-instance server config
cluster_configs/eos_example.yaml — EOS cluster config example

Test plan

Run python -m pytest tests/ to verify no regressions
Test multi-instance mode with gpus_per_node > 1 on a Slurm cluster

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added multi-GPU per node server mode support with automatic validation and configuration
- Implemented runtime port allocation for multi-instance deployments
- Added dynamic chunk processing with shell expression support
Improvements
- Optimized Slurm task scheduling with dynamic wait time configuration
- Updated Slurm log file naming patterns for better compatibility

coderabbitai · 2026-02-18T15:14:43Z

📝 Walkthrough

Walkthrough

This PR introduces multi-GPU per node support to the evaluation pipeline. It adds a gpus_per_node parameter that propagates through task scheduling, validation, and command generation. Multi-instance mode adjusts chunk aggregation, modifies Slurm wait behavior, enables shell-based runtime port computation, and configures per-GPU server instances with isolated CUDA devices.

Changes

Cohort / File(s)	Summary
Evaluation Entry Point `nemo_skills/pipeline/eval.py`	Added `gpus_per_node` CLI parameter (default 1) and threaded it into `prepare_eval_commands`. During scheduling, extracted `job_gpus_per_node` from job args, set `num_tasks=job_gpus_per_node` in task creation, and conditionally injected `gpus_per_node` into server config when > 1.
Evaluation Command Preparation `nemo_skills/pipeline/utils/eval.py`	Added `gpus_per_node` parameter with multi-instance validation (num_chunks must be provided and a multiple of gpus_per_node). Adjusted total_evals counting via base chunks aggregation. Introduced `base_chunks_to_run` derived from `benchmark_chunk_ids` and computed `effective_chunk_id` using SLURM_LOCALID for shell-based indexing in multi-instance mode.
Slurm Executor Configuration `nemo_skills/pipeline/utils/exp.py`	Updated srun log path patterns to include time placeholder (%t). Replaced hardcoded `--wait=10` with dynamic `srun_wait_seconds` logic: defaults to 3600 seconds when `tasks_per_node > 1`, otherwise uses provided value. Appends `--wait=<seconds>` only when set.
Generation & Client Configuration `nemo_skills/pipeline/utils/generation.py`	Added `gpus_per_node` parameter to `configure_client`. Implemented shell expression detection in `get_generation_cmd`: wraps `chunk_id` in quotes and uses SLURM_LOCALID-based indexing for runtime expansion. For multi-instance mode, computed server port at runtime as `base_port + SLURM_LOCALID` with shell quotation for dynamic expansion.
Server Command Configuration `nemo_skills/pipeline/utils/server.py`	Added `gpus_per_node` parameter with validation restricting multi-instance mode to "generic" server type. In multi-instance mode, built per-SLURM-task startup setting CUDA_VISIBLE_DEVICES to SLURM_LOCALID, running single-GPU server instances, and computing port as `server_port + SLURM_LOCALID`. Set `num_tasks=gpus_per_node` for multi-instance, 1 otherwise.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant EvalPipeline as eval.py
    participant Scheduler as Pipeline Scheduler
    participant JobPrep as prepare_eval_commands
    participant SlurmExec as Slurm Executor
    participant ServerSetup as Server Setup
    participant ClientSetup as Client Config
    participant SlurmTask as SLURM Task<br/>(per GPU instance)

    User->>EvalPipeline: eval(gpus_per_node=N)
    EvalPipeline->>JobPrep: prepare_eval_commands(gpus_per_node=N)
    JobPrep->>JobPrep: Validate num_chunks % N == 0
    JobPrep->>JobPrep: Derive base_chunks<br/>from benchmark_chunk_ids
    
    EvalPipeline->>Scheduler: Schedule tasks<br/>num_tasks=N per job
    Scheduler->>SlurmExec: Configure job<br/>tasks_per_node > 1
    SlurmExec->>SlurmExec: Set srun_wait=3600s<br/>for multi-instance
    
    Scheduler->>ServerSetup: get_server_command(gpus_per_node=N)
    ServerSetup->>ServerSetup: Build multi-instance startup<br/>per SLURM task
    ServerSetup->>SlurmTask: CUDA_VISIBLE_DEVICES=$SLURM_LOCALID<br/>port=$base_port+$SLURM_LOCALID
    
    Scheduler->>ClientSetup: configure_client(gpus_per_node=N)
    ClientSetup->>ClientSetup: Compute port at runtime<br/>=$base_port+$SLURM_LOCALID
    
    JobPrep->>JobPrep: Compute effective_chunk_id<br/>using $SLURM_LOCALID
    
    SlurmTask->>SlurmTask: Execute generation cmd<br/>with runtime chunk indexing

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #906: Modifies the same CLI entrypoints (eval()) to add a new option and propagate it through task preparation and creation, indicating parallel feature integration patterns.
PR #1052: Directly related at the code level, modifying generation and server command paths with multi-instance and multi-model runtime wiring.
PR #1133: Refactors the same pipeline generation and server utilities (generation.py, server.py), so multi-instance changes touch areas previously refactored.

Suggested reviewers

gwarmstrong
activatedgeek
Kipok

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add multi-instance pipeline support (gpus_per_node)' directly and clearly summarizes the main change: introducing a new multi-instance mode feature controlled via gpus_per_node parameter across the evaluation pipeline.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr1/multi-instance-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/eval.py (1)

620-632: ⚠️ Potential issue | 🟡 Minor

Silent misconfiguration risk: gpus_per_node > 1 with a non-SLURM executor.

When executor != "slurm", SLURM_LOCALID is unset and the shell expression $(({base_chunk} + $SLURM_LOCALID)) evaluates to {base_chunk} for every task (all tasks process the same chunk). Consider adding an early guard:

🛡️ Suggested guard in prepare_eval_commands (utils/eval.py)

 if gpus_per_node > 1:
     if num_chunks is None:
         raise ValueError("gpus_per_node > 1 requires num_chunks to be specified")
     if num_chunks % gpus_per_node != 0:
         raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})")
+    if cluster_config.get("executor") != "slurm":
+        raise ValueError("gpus_per_node > 1 is only supported for slurm executor")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 620 - 632, In
prepare_eval_commands (utils/eval.py) add a guard that prevents using multiple
GPUs per node with non-SLURM executors: detect when job_gpus_per_node > 1 and
executor != "slurm" and either raise a clear error or force job_gpus_per_node =
1 (and log) to avoid the incorrect reuse of the same chunk due to SLURM_LOCALID
being unset; reference the symbols job_gpus_per_node, executor, SLURM_LOCALID
and base_chunk so the check is placed before commands that compute
$(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in add_task.

🧹 Nitpick comments (3)

nemo_skills/pipeline/utils/generation.py (2)

527-531: Dead output_file reassignment — leftover from refactoring.

output_file is reassigned at Line 527 but nothing downstream in the if chunk_id is not None: block reads it (job_end_cmd now uses donefiles[chunk_id] directly). This assignment can be removed.

♻️ Proposed cleanup

     else:
-        output_file = get_chunked_rs_filename(output_dir, random_seed=random_seed, chunk_id=chunk_id)
-        if job_end_cmd:
+        if job_end_cmd:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 527 - 531, Remove the
dead reassignment of output_file: the call to get_chunked_rs_filename that sets
output_file (in the block handling chunk_id) is unused later, so delete that
assignment; keep the existing logic that appends or sets job_end_cmd using
donefiles[chunk_id] and ensure no other code in the surrounding function (e.g.,
where job_end_cmd, donefiles, chunk_id are used) depends on output_file so
removing the get_chunked_rs_filename call is safe.

650-659: $SLURM_LOCALID used without a default — style inconsistency with server.py.

server.py (Line 222) uses ${SLURM_LOCALID:-0} for the same arithmetic, making the default explicit. Inside $(( )) bash expands an unset variable to 0, so this is not a runtime bug, but aligning both sites aids readability.

♻️ Proposed consistency fix

-            f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
+            f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 659, The code
builds extra_arguments in the gpus_per_node > 1 branch using "$SLURM_LOCALID"
without a default; update the f-string in the multi-instance branch (where
gpus_per_node and extra_arguments are set and server_port/model are used) to use
"${SLURM_LOCALID:-0}" inside the arithmetic expression so it matches the
explicit default style used in server.py (i.e. replace "$(({server_port} +
$SLURM_LOCALID))" with "$(({server_port} + ${SLURM_LOCALID:-0}))").

nemo_skills/pipeline/utils/eval.py (1)

277-282: Multi-instance validation logic is correct; consider shorter inline messages (Ruff TRY003).

Static analysis flags Lines 279 and 281 for long messages defined outside a custom exception class. The logic itself is correct; the TRY003 note is minor.

♻️ Optional: extract messages to constants or a custom error

+class MultiInstanceConfigError(ValueError):
+    pass
+
 if gpus_per_node > 1:
     if num_chunks is None:
-        raise ValueError("gpus_per_node > 1 requires num_chunks to be specified")
+        raise MultiInstanceConfigError("gpus_per_node > 1 requires num_chunks to be specified")
     if num_chunks % gpus_per_node != 0:
-        raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})")
+        raise MultiInstanceConfigError(
+            f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})"
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 277 - 282, The
multi-instance validation is fine but the long inline error/log messages trigger
TRY003; shorten or extract them: in the block that checks gpus_per_node and
num_chunks (references: gpus_per_node, num_chunks, LOG), replace the long
f-strings with concise messages or move the full messages into module-level
constants or a custom exception class and raise/use those constants (e.g.,
change raise ValueError(...) and LOG.info(...) to use shorter strings or
NAME_ERROR_MSG constants) so the messages are not defined inline inside the
function.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster_configs/eos_example.yaml`:
- Around line 47-48: The YAML sets mail_type: FAIL while mail_user is null which
lets exp.py (which only adds cluster_config.get("mail_user") when not None) emit
--mail-type=FAIL without --mail-user; update the cluster config by supplying a
non-null placeholder string for mail_user (e.g., your email or a clear
placeholder) so the SLURM sbatch invocation includes --mail-user and the intent
is explicit; ensure the mail_user field in this file is not empty to avoid
silent suppression by SLURM.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current logic in the is_shell_expr branch does a
blind string replace on base_donefile which can silently fail or modify
directory components; change it to operate on the filename only: extract the
directory and basename from donefiles[0] (use the basename for the replacement
of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"), verify the basename contains
the expected "_chunk_0.jsonl" pattern and raise or log an error if not found,
then rejoin dir + modified basename to form donefile_pattern, and finally append
that safe donefile_pattern to job_end_cmd (symbols: is_shell_expr,
base_donefile, donefiles, donefile_pattern, chunk_id, job_end_cmd).

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 620-632: In prepare_eval_commands (utils/eval.py) add a guard that
prevents using multiple GPUs per node with non-SLURM executors: detect when
job_gpus_per_node > 1 and executor != "slurm" and either raise a clear error or
force job_gpus_per_node = 1 (and log) to avoid the incorrect reuse of the same
chunk due to SLURM_LOCALID being unset; reference the symbols job_gpus_per_node,
executor, SLURM_LOCALID and base_chunk so the check is placed before commands
that compute $(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in
add_task.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 277-282: The multi-instance validation is fine but the long inline
error/log messages trigger TRY003; shorten or extract them: in the block that
checks gpus_per_node and num_chunks (references: gpus_per_node, num_chunks,
LOG), replace the long f-strings with concise messages or move the full messages
into module-level constants or a custom exception class and raise/use those
constants (e.g., change raise ValueError(...) and LOG.info(...) to use shorter
strings or NAME_ERROR_MSG constants) so the messages are not defined inline
inside the function.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 527-531: Remove the dead reassignment of output_file: the call to
get_chunked_rs_filename that sets output_file (in the block handling chunk_id)
is unused later, so delete that assignment; keep the existing logic that appends
or sets job_end_cmd using donefiles[chunk_id] and ensure no other code in the
surrounding function (e.g., where job_end_cmd, donefiles, chunk_id are used)
depends on output_file so removing the get_chunked_rs_filename call is safe.
- Around line 650-659: The code builds extra_arguments in the gpus_per_node > 1
branch using "$SLURM_LOCALID" without a default; update the f-string in the
multi-instance branch (where gpus_per_node and extra_arguments are set and
server_port/model are used) to use "${SLURM_LOCALID:-0}" inside the arithmetic
expression so it matches the explicit default style used in server.py (i.e.
replace "$(({server_port} + $SLURM_LOCALID))" with "$(({server_port} +
${SLURM_LOCALID:-0}))").

cluster_configs/eos_example.yaml

coderabbitai · 2026-02-18T15:14:47Z

nemo_skills/pipeline/utils/generation.py

+        if is_shell_expr:
+            # For shell expression, compute the donefile path at runtime
+            # Get the base pattern with _chunk_0 and replace with shell expression
+            base_donefile = donefiles[0]  # e.g., /path/output_chunk_0.jsonl.done
+            # Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell
+            # Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))")
+            donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl")
+            if job_end_cmd:
+                job_end_cmd += f' && touch "{donefile_pattern}" '
+            else:
+                job_end_cmd = f'touch "{donefile_pattern}" '


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find and examine get_chunked_filename function rg -n "def get_chunked_filename" --type py -A 10 # Also search for where donefiles is constructed rg -n "donefiles\s*=" --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 792

🏁 Script executed:

#!/bin/bash # Read the specific file and lines mentioned in the review cat -n nemo_skills/pipeline/utils/generation.py | sed -n '490,540p'

Repository: NVIDIA-NeMo/Skills

Length of output: 2945

🏁 Script executed:

#!/bin/bash # Check how donefiles is built and what patterns are used rg -n "donefiles" nemo_skills/pipeline/utils/generation.py -B 2 -A 2 | head -60

Repository: NVIDIA-NeMo/Skills

Length of output: 1663

🏁 Script executed:

#!/bin/bash # Find get_chunked_rs_filename function rg -n "def get_chunked_rs_filename" --type py -A 8

Repository: NVIDIA-NeMo/Skills

Length of output: 704

🏁 Script executed:

#!/bin/bash # Also check imports in generation.py to see where functions come from head -50 nemo_skills/pipeline/utils/generation.py | grep -E "^import|^from"

Repository: NVIDIA-NeMo/Skills

Length of output: 502

🏁 Script executed:

#!/bin/bash # Read the full get_chunked_rs_filename implementation cat -n nemo_skills/pipeline/utils/generation.py | sed -n '152,175p'

Repository: NVIDIA-NeMo/Skills

Length of output: 947

🏁 Script executed:

#!/bin/bash # Verify the exact pattern by checking what donefiles[0] would look like # with different random_seed values cat -n nemo_skills/pipeline/utils/generation.py | sed -n '510,525p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1159

🏁 Script executed:

#!/bin/bash # Check if there are any tests or usages that show actual donefile paths rg "donefile" --type py -A 2 -B 2 | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 3276

🏁 Script executed:

#!/bin/bash # Let me trace through the exact pattern by simulating what get_chunked_filename produces cat -n nemo_skills/utils.py | sed -n '383,386p'

Repository: NVIDIA-NeMo/Skills

Length of output: 244

🏁 Script executed:

#!/bin/bash # Verify if there could be any edge cases where _chunk_0.jsonl appears elsewhere rg "_chunk_0" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 694

Fragile string replacement for shell-expression donefile path — no error if pattern is absent.

base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl") silently returns the original string when the pattern isn't found (e.g., naming convention change) and could replace in a directory component if the output path itself contains _chunk_0.jsonl. The touch would then create a file at the wrong path and the subsequent merge_chunks would stall waiting for a done file that never appears.

Limit the replacement to the basename only and validate the pattern was found:

🛡️ Proposed fix

- base_donefile = donefiles[0] # e.g., /path/output_chunk_0.jsonl.done - # Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell - # Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))") - donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl") + import os as _os + base_donefile = donefiles[0] + _dir, _fname = _os.path.split(base_donefile) + _new_fname = _fname.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl", 1) + if _new_fname == _fname: + raise RuntimeError( + f"Could not build shell-expression donefile from {base_donefile!r}; " + "'_chunk_0.jsonl' pattern not found in filename." + ) + donefile_pattern = _os.path.join(_dir, _new_fname)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 515 - 525, The current logic in the is_shell_expr branch does a blind string replace on base_donefile which can silently fail or modify directory components; change it to operate on the filename only: extract the directory and basename from donefiles[0] (use the basename for the replacement of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"), verify the basename contains the expected "_chunk_0.jsonl" pattern and raise or log an error if not found, then rejoin dir + modified basename to form donefile_pattern, and finally append that safe donefile_pattern to job_end_cmd (symbols: is_shell_expr, base_donefile, donefiles, donefile_pattern, chunk_id, job_end_cmd).

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

nemo_skills/pipeline/utils/server.py (1)
216-216: Shell quoting in echo is fragile but functional.

The echo statement interleaves single-quoted literals with unquoted variable expansions. While it works, it would break if SLURM_LOCALID or SLURM_PROCID contained spaces or special characters (unlikely for SLURM env vars, but worth noting).
♻️ Cleaner alternative
-                f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && "
+                f'echo "SLURM_LOCALID=$SLURM_LOCALID SLURM_PROCID=$SLURM_PROCID" && '
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/server.py` at line 216, The echo fragment
currently interleaves single-quoted literals and unquoted expansions (the part
that prints SLURM_LOCALID and SLURM_PROCID), which is fragile for values with
spaces/special characters; update the command to use a safely-quoted form (e.g.,
wrap the whole message in double quotes so the expansions are quoted, or use
printf with '%s\n') so SLURM_LOCALID and SLURM_PROCID are emitted correctly even
if they contain spaces or special chars.
nemo_skills/pipeline/utils/eval.py (1)
503-515: Consider using a dataclass or namedtuple for the job batch tuple.

The 9-element tuple is getting unwieldy and fragile — any reordering or addition requires synchronized changes in both the producer (here) and consumer (eval.py lines 606-616). The inline comment # TODO: move to a dataclass at line 502 acknowledges this.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 515, Replace the
fragile 9-tuple appended into job_batches with a dedicated dataclass (e.g.,
JobBatch) containing fields for job_cmds, job_benchmarks, job_needs_sandbox,
job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address,
server_command_fn (from generation_task.get_server_command_fn()),
job_sandbox_env_overrides, and gpus_per_node; update the producer (where
job_batches.append(...) is called) to construct and append JobBatch instances
and update the consumer code that currently unpacks the tuple (the block that
iterates over job_batches and indexes those 9 elements) to access the named
attributes (job_batch.job_cmds, job_batch.server_command_fn, etc.), and add any
necessary import/definition of the dataclass near the top of eval.py so future
reordering or additions are type-safe and readable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 650-655: The client-side port math in generation.py uses bare
$SLURM_LOCALID which can fail under set -u; update the string building that
produces extra_arguments (the branch where gpus_per_node > 1) to use the same
fallback syntax as server.py, replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0}
in the client port expression (the reference symbols are extra_arguments,
server_port and model) so the computed port is robust when SLURM_LOCALID is
unset.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: Add a validation in get_server_command to detect
unsupported multi-GPU-per-node configurations: if gpus_per_node > 1 and
server_type is not "generic" (i.e., for server_type values like "vllm",
"sglang", "trtllm", "megatron"), raise a clear ValueError explaining that
multi-instance mode is only supported for the "generic" server_type; place this
check near the top of get_server_command (after the existing parameter parsing
around line ~125) so the function fails fast rather than silently ignoring
gpus_per_node.

---

Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current fragile replacement using
base_donefile.replace("_chunk_0.jsonl", ...) can silently fail or alter
directory names; instead split base_donefile into dirname and basename, then
match/replace only the basename using a strict suffix or regex (e.g.,
r"_chunk_\d+\.jsonl$") to produce new_basename = re.sub(...,
f"_chunk_{chunk_id}.jsonl", basename); reconstruct donefile_pattern =
os.path.join(dirname, new_basename) and if the pattern isn't found, log or raise
an error rather than leaving the path unchanged; update the code that sets
donefile_pattern and job_end_cmd (references: is_shell_expr, base_donefile,
donefile_pattern, job_end_cmd, chunk_id) accordingly.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 503-515: Replace the fragile 9-tuple appended into job_batches
with a dedicated dataclass (e.g., JobBatch) containing fields for job_cmds,
job_benchmarks, job_needs_sandbox, job_needs_sandbox_to_keep_mounts,
job_server_config, job_server_address, server_command_fn (from
generation_task.get_server_command_fn()), job_sandbox_env_overrides, and
gpus_per_node; update the producer (where job_batches.append(...) is called) to
construct and append JobBatch instances and update the consumer code that
currently unpacks the tuple (the block that iterates over job_batches and
indexes those 9 elements) to access the named attributes (job_batch.job_cmds,
job_batch.server_command_fn, etc.), and add any necessary import/definition of
the dataclass near the top of eval.py so future reordering or additions are
type-safe and readable.

In `@nemo_skills/pipeline/utils/server.py`:
- Line 216: The echo fragment currently interleaves single-quoted literals and
unquoted expansions (the part that prints SLURM_LOCALID and SLURM_PROCID), which
is fragile for values with spaces/special characters; update the command to use
a safely-quoted form (e.g., wrap the whole message in double quotes so the
expansions are quoted, or use printf with '%s\n') so SLURM_LOCALID and
SLURM_PROCID are emitted correctly even if they contain spaces or special chars.

coderabbitai · 2026-02-18T22:01:47Z

nemo_skills/pipeline/utils/generation.py

+        if gpus_per_node > 1:
+            # Multi-instance mode: port is computed at runtime based on SLURM_LOCALID
+            extra_arguments = (
+                f"++server.host=127.0.0.1 "
+                f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
+            )


⚠️ Potential issue | 🟡 Minor

Inconsistent SLURM_LOCALID fallback between server and client port computation.

server.py (line 222) uses ${SLURM_LOCALID:-0} (with fallback to 0), but here the client uses bare $SLURM_LOCALID without a default. If SLURM_LOCALID is unset and set -u (or equivalent) is active, the client command will fail while the server succeeds.

🛡️ Proposed fix — use consistent fallback

- f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}' + f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if gpus_per_node > 1:

# Multi-instance mode: port is computed at runtime based on SLURM_LOCALID

extra_arguments = (

f"++server.host=127.0.0.1 "

f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'

)

if gpus_per_node > 1:

# Multi-instance mode: port is computed at runtime based on SLURM_LOCALID

extra_arguments = (

f"++server.host=127.0.0.1 "

f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 655, The client-side port math in generation.py uses bare $SLURM_LOCALID which can fail under set -u; update the string building that produces extra_arguments (the branch where gpus_per_node > 1) to use the same fallback syntax as server.py, replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0} in the client port expression (the reference symbols are extra_arguments, server_port and model) so the computed port is robust when SLURM_LOCALID is unset.

nemo_skills/pipeline/utils/server.py

coderabbitai

🧹 Nitpick comments (2)

nemo_skills/pipeline/utils/eval.py (2)

460-463: Use ${SLURM_LOCALID:-0} for consistency with server-side fallback.

The server side (server.py line 222) uses ${SLURM_LOCALID:-0}, but here the client uses bare $SLURM_LOCALID. While SLURM_LOCALID is always set by srun, using a consistent fallback avoids arithmetic errors if this code path is ever exercised outside a SLURM context.

🛡️ Proposed fix

                 effective_chunk_id = chunk_id
                 if gpus_per_node > 1:
-                    effective_chunk_id = f"$(({chunk_id} + $SLURM_LOCALID))"
+                    effective_chunk_id = f"$(({chunk_id} + ${{SLURM_LOCALID:-0}}))"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 460 - 463, The shell
arithmetic expression for multi-instance chunk IDs should use a safe SLURM
fallback like the server does: update the logic in eval.py where
effective_chunk_id is set (referencing variables chunk_id and gpus_per_node) so
that when gpus_per_node > 1 you produce the expression using ${SLURM_LOCALID:-0}
(e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of the bare
$SLURM_LOCALID to avoid arithmetic failures outside SLURM.

503-514: Job batch tuple correctly extended with gpus_per_node.

The 9-element tuple structure here matches the destructuring in eval.py (lines 606-616). Consider migrating this to a dataclass as the TODO on line 502 suggests — this would prevent positional mismatch bugs as the tuple grows.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 514, The job batch
tuple appended to job_batches now includes gpus_per_node and risks
positional-mismatch bugs as it grows; define a dataclass (e.g., JobBatch) with
explicit fields for job_cmds, job_benchmarks, job_needs_sandbox,
job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address,
server_command_fn, job_sandbox_env_overrides, and gpus_per_node, then replace
the tuple construction in the job_batches.append(...) call with JobBatch(...)
(referencing generation_task.get_server_command_fn() for server_command_fn) and
update all sites that unpack the tuple (the destructuring around eval.py where
job_batches is iterated/unpacked) to use attribute access (e.g., batch.job_cmds)
instead of positional indices.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The replacement of "_chunk_0.jsonl" is fragile because it
operates on the full path (variable base_donefile) — change the logic in the
is_shell_expr branch to extract dirname and basename (use donefiles[0] to derive
base_donefile), perform the "_chunk_0.jsonl" replacement only on the basename,
validate that the basename contained the expected pattern and handle the failure
(raise/log/error out) if not found, then rejoin dirname and the new basename
into donefile_pattern and append that safe full path to job_end_cmd (references:
is_shell_expr, base_donefile/donefiles, chunk_id, donefile_pattern, job_end_cmd,
merge_chunks).
- Around line 650-655: The client-side multi-instance port computation in
nemo_skills/pipeline/utils/generation.py uses bare $SLURM_LOCALID which can fail
if SLURM_LOCALID is unset; update the extra_arguments construction used when
gpus_per_node > 1 to use the same safe fallback syntax as the server (e.g.,
${SLURM_LOCALID:-0}) so the computed port string becomes robust and consistent
with server.py's handling of SLURM_LOCALID.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: The function builds a multi-instance SLURM server only
for the "generic" server type but does not validate that gpus_per_node > 1 is
unsupported for other server types, causing client/server port mismatches; add a
fail-fast check near the top of the function that validates if gpus_per_node > 1
and server_type is not "generic" then raise a clear exception (e.g., ValueError)
mentioning server_type, gpus_per_node and that multi-instance mode is only
supported for "generic" (so callers in generation.py that offset ports per
SLURM_LOCALID won't silently fail); include this check before any
server_start_cmd logic that references server_entrypoint or SLURM_LOCALID.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 460-463: The shell arithmetic expression for multi-instance chunk
IDs should use a safe SLURM fallback like the server does: update the logic in
eval.py where effective_chunk_id is set (referencing variables chunk_id and
gpus_per_node) so that when gpus_per_node > 1 you produce the expression using
${SLURM_LOCALID:-0} (e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of
the bare $SLURM_LOCALID to avoid arithmetic failures outside SLURM.
- Around line 503-514: The job batch tuple appended to job_batches now includes
gpus_per_node and risks positional-mismatch bugs as it grows; define a dataclass
(e.g., JobBatch) with explicit fields for job_cmds, job_benchmarks,
job_needs_sandbox, job_needs_sandbox_to_keep_mounts, job_server_config,
job_server_address, server_command_fn, job_sandbox_env_overrides, and
gpus_per_node, then replace the tuple construction in the
job_batches.append(...) call with JobBatch(...) (referencing
generation_task.get_server_command_fn() for server_command_fn) and update all
sites that unpack the tuple (the destructuring around eval.py where job_batches
is iterated/unpacked) to use attribute access (e.g., batch.job_cmds) instead of
positional indices.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

nemo_skills/pipeline/utils/server.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 220-242: Add a fast-fail validation immediately after the
gpus_per_node > 1 and server_type != "generic" guard to ensure gpus_per_node
does not exceed num_gpus: check the variables gpus_per_node and num_gpus and
raise a ValueError with a clear message like "gpus_per_node (X) cannot exceed
num_gpus (Y)" if gpus_per_node > num_gpus, so you avoid launching extra SLURM
tasks that bind to non-existent GPUs (this relates to the multi-instance path
that builds server_start_cmd using server_entrypoint and CUDA_VISIBLE_DEVICES).

coderabbitai · 2026-02-19T04:14:19Z

nemo_skills/pipeline/utils/server.py

+        if gpus_per_node > 1:
+            # Multi-instance mode: each SLURM task gets its own GPU and port
+            server_start_cmd = (
+                f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && "
+                f"export CUDA_VISIBLE_DEVICES=${{SLURM_LOCALID:-0}} && "
+                f"{server_entrypoint} "
+                f"    --model {model_path} "
+                f"    --num_gpus 1 "
+                f"    --num_nodes 1 "
+                f"    --port $(({server_port} + ${{SLURM_LOCALID:-0}})) "
+                f"    {server_args} "
+            )
+            num_tasks = gpus_per_node
+        else:
+            server_start_cmd = (
+                f"{server_entrypoint} "
+                f"    --model {model_path} "
+                f"    --num_gpus {num_gpus} "
+                f"    --num_nodes {num_nodes} "
+                f"    --port {server_port} "
+                f"    {server_args} "
+            )
+            num_tasks = 1


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the current file structure and the changes cd /tmp/repo 2>/dev/null || cd . # 1. Find the server.py file and verify it exists find . -name "server.py" -path "*/pipeline/utils/*" 2>/dev/null | head -5 # 2. Check git diff if in a repo if git rev-parse --git-dir > /dev/null 2>&1; then echo "=== Git status ===" git status --short echo "=== Checking git diff for server.py ===" git diff HEAD -- nemo_skills/pipeline/utils/server.py 2>/dev/null | head -150 else echo "Not in a git repo" fi

Repository: NVIDIA-NeMo/Skills

Length of output: 172

🏁 Script executed:

#!/bin/bash # 3. Search for gpus_per_node in the codebase to see if it's validated echo "=== Searching for gpus_per_node usage ===" rg "gpus_per_node" --type py -B 3 -A 3 # 4. Search for any assertions or validations related to GPU counts echo "=== Searching for GPU count validations ===" rg "(gpus_per_node|num_gpus)" --type py | rg -E "(assert|raise|if.*<=|if.*>=|if.*<|if.*>)" -B 1 -A 1

Repository: NVIDIA-NeMo/Skills

Length of output: 21948

🏁 Script executed:

#!/bin/bash # 5. Check the server.py file around lines 220-242 to see current state echo "=== Reading server.py around lines 220-242 ===" if [ -f "nemo_skills/pipeline/utils/server.py" ]; then sed -n '210,250p' nemo_skills/pipeline/utils/server.py | cat -n else echo "File not found, searching..." find . -name "server.py" -path "*/pipeline/utils/*" -exec sed -n '210,250p' {} \; 2>/dev/null fi

Repository: NVIDIA-NeMo/Skills

Length of output: 2001

🏁 Script executed:

#!/bin/bash # 6. Check how num_tasks is used after being set echo "=== Checking num_tasks usage in server.py ===" rg "num_tasks" nemo_skills/pipeline/utils/server.py -B 2 -A 2 # 7. Check exp.py for num_tasks usage echo "=== Checking num_tasks in exp.py ===" rg "num_tasks" nemo_skills/pipeline/utils/exp.py -B 2 -A 2 2>/dev/null || echo "exp.py patterns not found" # 8. Search broader for num_tasks usage echo "=== Broader num_tasks search ===" rg "num_tasks" --type py -l 2>/dev/null | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 2518

🏁 Script executed:

#!/bin/bash # 9. Look at the full function context around lines 220-242 to understand what this is echo "=== Full context of the function ===" if [ -f "nemo_skills/pipeline/utils/server.py" ]; then sed -n '100,260p' nemo_skills/pipeline/utils/server.py | cat -n fi

Repository: NVIDIA-NeMo/Skills

Length of output: 7162

Add validation for gpus_per_node <= num_gpus in multi-instance mode.

The num_tasks = 1 assignment for single-instance generic (line 143) is intentional and consistent with vllm, sglang, and trtllm implementations—no issue here.

However, the multi-instance path (lines 121–133) lacks validation that gpus_per_node ≤ num_gpus. If a user passes gpus_per_node=8 but num_gpus=4, the code launches 8 SLURM tasks with CUDA_VISIBLE_DEVICES=${SLURM_LOCALID:-0}, where SLURM_LOCALID ranges 0–7. Tasks with IDs 4–7 will fail at runtime trying to access non-existent GPUs. Add an explicit guard to fail fast with a clear error message:

if gpus_per_node > num_gpus: raise ValueError( f"gpus_per_node ({gpus_per_node}) cannot exceed num_gpus ({num_gpus})" )

Place this check immediately after the gpus_per_node > 1 and server_type != "generic" validation block.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/server.py` around lines 220 - 242, Add a fast-fail validation immediately after the gpus_per_node > 1 and server_type != "generic" guard to ensure gpus_per_node does not exceed num_gpus: check the variables gpus_per_node and num_gpus and raise a ValueError with a clear message like "gpus_per_node (X) cannot exceed num_gpus (Y)" if gpus_per_node > num_gpus, so you avoid launching extra SLURM tasks that bind to non-existent GPUs (this relates to the multi-instance path that builds server_start_cmd using server_entrypoint and CUDA_VISIBLE_DEVICES).

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

rfejgin force-pushed the pr1/multi-instance-pipeline branch from 2db65dc to 7cd1e66 Compare February 18, 2026 21:54

vmendelev added 5 commits February 18, 2026 13:58

EOS FIX 8 chunks per node

409c8ee

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

eos config example

1574526

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Avoid killing multi-instance tasks via srun --wait

436e7b3

Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Override srun wait for multi-instance jobs

83c2375

Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Set multi-instance srun wait to 1 hour

fbfdff9

Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin force-pushed the pr1/multi-instance-pipeline branch from 7cd1e66 to fbfdff9 Compare February 18, 2026 21:58

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

rfejgin added the run GPU tests label Feb 18, 2026

rfejgin added 2 commits February 18, 2026 19:44

Remove internal config file

db294bf

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Detect misconfiguration of gpus_per_node

42eb131

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin reviewed Feb 19, 2026

View reviewed changes

nemo_skills/pipeline/utils/server.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

gwarmstrong added the not ready label Feb 27, 2026

Conversation

vmendelev commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vmendelev commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading