Skip to content

Add multi-instance pipeline support (gpus_per_node)#1247

Open
vmendelev wants to merge 7 commits intomainfrom
pr1/multi-instance-pipeline
Open

Add multi-instance pipeline support (gpus_per_node)#1247
vmendelev wants to merge 7 commits intomainfrom
pr1/multi-instance-pipeline

Conversation

@vmendelev
Copy link
Collaborator

@vmendelev vmendelev commented Feb 18, 2026

Summary

  • Add gpus_per_node multi-instance mode for running multiple model instances per node
  • Dynamic port computation based on SLURM_LOCALID for multi-instance jobs
  • Add srun --wait configuration to avoid premature killing of multi-instance tasks
  • Set default multi-instance srun wait to 1 hour
  • Add EOS cluster config example

Files changed

  • nemo_skills/pipeline/eval.py — pass gpus_per_node through eval pipeline
  • nemo_skills/pipeline/utils/eval.py — multi-instance chunking logic
  • nemo_skills/pipeline/utils/exp.py — srun --wait for multi-instance jobs
  • nemo_skills/pipeline/utils/generation.py — SLURM_LOCALID-based port in configure_client
  • nemo_skills/pipeline/utils/server.py — multi-instance server config
  • cluster_configs/eos_example.yaml — EOS cluster config example

Test plan

  • Run python -m pytest tests/ to verify no regressions
  • Test multi-instance mode with gpus_per_node > 1 on a Slurm cluster

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added multi-GPU per node server mode support with automatic validation and configuration
    • Implemented runtime port allocation for multi-instance deployments
    • Added dynamic chunk processing with shell expression support
  • Improvements

    • Optimized Slurm task scheduling with dynamic wait time configuration
    • Updated Slurm log file naming patterns for better compatibility

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

This PR introduces multi-GPU per node support to the evaluation pipeline. It adds a gpus_per_node parameter that propagates through task scheduling, validation, and command generation. Multi-instance mode adjusts chunk aggregation, modifies Slurm wait behavior, enables shell-based runtime port computation, and configures per-GPU server instances with isolated CUDA devices.

Changes

Cohort / File(s) Summary
Evaluation Entry Point
nemo_skills/pipeline/eval.py
Added gpus_per_node CLI parameter (default 1) and threaded it into prepare_eval_commands. During scheduling, extracted job_gpus_per_node from job args, set num_tasks=job_gpus_per_node in task creation, and conditionally injected gpus_per_node into server config when > 1.
Evaluation Command Preparation
nemo_skills/pipeline/utils/eval.py
Added gpus_per_node parameter with multi-instance validation (num_chunks must be provided and a multiple of gpus_per_node). Adjusted total_evals counting via base chunks aggregation. Introduced base_chunks_to_run derived from benchmark_chunk_ids and computed effective_chunk_id using SLURM_LOCALID for shell-based indexing in multi-instance mode.
Slurm Executor Configuration
nemo_skills/pipeline/utils/exp.py
Updated srun log path patterns to include time placeholder (%t). Replaced hardcoded --wait=10 with dynamic srun_wait_seconds logic: defaults to 3600 seconds when tasks_per_node > 1, otherwise uses provided value. Appends --wait=<seconds> only when set.
Generation & Client Configuration
nemo_skills/pipeline/utils/generation.py
Added gpus_per_node parameter to configure_client. Implemented shell expression detection in get_generation_cmd: wraps chunk_id in quotes and uses SLURM_LOCALID-based indexing for runtime expansion. For multi-instance mode, computed server port at runtime as base_port + SLURM_LOCALID with shell quotation for dynamic expansion.
Server Command Configuration
nemo_skills/pipeline/utils/server.py
Added gpus_per_node parameter with validation restricting multi-instance mode to "generic" server type. In multi-instance mode, built per-SLURM-task startup setting CUDA_VISIBLE_DEVICES to SLURM_LOCALID, running single-GPU server instances, and computing port as server_port + SLURM_LOCALID. Set num_tasks=gpus_per_node for multi-instance, 1 otherwise.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant EvalPipeline as eval.py
    participant Scheduler as Pipeline Scheduler
    participant JobPrep as prepare_eval_commands
    participant SlurmExec as Slurm Executor
    participant ServerSetup as Server Setup
    participant ClientSetup as Client Config
    participant SlurmTask as SLURM Task<br/>(per GPU instance)

    User->>EvalPipeline: eval(gpus_per_node=N)
    EvalPipeline->>JobPrep: prepare_eval_commands(gpus_per_node=N)
    JobPrep->>JobPrep: Validate num_chunks % N == 0
    JobPrep->>JobPrep: Derive base_chunks<br/>from benchmark_chunk_ids
    
    EvalPipeline->>Scheduler: Schedule tasks<br/>num_tasks=N per job
    Scheduler->>SlurmExec: Configure job<br/>tasks_per_node > 1
    SlurmExec->>SlurmExec: Set srun_wait=3600s<br/>for multi-instance
    
    Scheduler->>ServerSetup: get_server_command(gpus_per_node=N)
    ServerSetup->>ServerSetup: Build multi-instance startup<br/>per SLURM task
    ServerSetup->>SlurmTask: CUDA_VISIBLE_DEVICES=$SLURM_LOCALID<br/>port=$base_port+$SLURM_LOCALID
    
    Scheduler->>ClientSetup: configure_client(gpus_per_node=N)
    ClientSetup->>ClientSetup: Compute port at runtime<br/>=$base_port+$SLURM_LOCALID
    
    JobPrep->>JobPrep: Compute effective_chunk_id<br/>using $SLURM_LOCALID
    
    SlurmTask->>SlurmTask: Execute generation cmd<br/>with runtime chunk indexing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #906: Modifies the same CLI entrypoints (eval()) to add a new option and propagate it through task preparation and creation, indicating parallel feature integration patterns.
  • PR #1052: Directly related at the code level, modifying generation and server command paths with multi-instance and multi-model runtime wiring.
  • PR #1133: Refactors the same pipeline generation and server utilities (generation.py, server.py), so multi-instance changes touch areas previously refactored.

Suggested reviewers

  • gwarmstrong
  • activatedgeek
  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add multi-instance pipeline support (gpus_per_node)' directly and clearly summarizes the main change: introducing a new multi-instance mode feature controlled via gpus_per_node parameter across the evaluation pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr1/multi-instance-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

620-632: ⚠️ Potential issue | 🟡 Minor

Silent misconfiguration risk: gpus_per_node > 1 with a non-SLURM executor.

When executor != "slurm", SLURM_LOCALID is unset and the shell expression $(({base_chunk} + $SLURM_LOCALID)) evaluates to {base_chunk} for every task (all tasks process the same chunk). Consider adding an early guard:

🛡️ Suggested guard in prepare_eval_commands (utils/eval.py)
 if gpus_per_node > 1:
     if num_chunks is None:
         raise ValueError("gpus_per_node > 1 requires num_chunks to be specified")
     if num_chunks % gpus_per_node != 0:
         raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})")
+    if cluster_config.get("executor") != "slurm":
+        raise ValueError("gpus_per_node > 1 is only supported for slurm executor")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 620 - 632, In
prepare_eval_commands (utils/eval.py) add a guard that prevents using multiple
GPUs per node with non-SLURM executors: detect when job_gpus_per_node > 1 and
executor != "slurm" and either raise a clear error or force job_gpus_per_node =
1 (and log) to avoid the incorrect reuse of the same chunk due to SLURM_LOCALID
being unset; reference the symbols job_gpus_per_node, executor, SLURM_LOCALID
and base_chunk so the check is placed before commands that compute
$(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in add_task.
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/generation.py (2)

527-531: Dead output_file reassignment — leftover from refactoring.

output_file is reassigned at Line 527 but nothing downstream in the if chunk_id is not None: block reads it (job_end_cmd now uses donefiles[chunk_id] directly). This assignment can be removed.

♻️ Proposed cleanup
     else:
-        output_file = get_chunked_rs_filename(output_dir, random_seed=random_seed, chunk_id=chunk_id)
-        if job_end_cmd:
+        if job_end_cmd:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 527 - 531, Remove the
dead reassignment of output_file: the call to get_chunked_rs_filename that sets
output_file (in the block handling chunk_id) is unused later, so delete that
assignment; keep the existing logic that appends or sets job_end_cmd using
donefiles[chunk_id] and ensure no other code in the surrounding function (e.g.,
where job_end_cmd, donefiles, chunk_id are used) depends on output_file so
removing the get_chunked_rs_filename call is safe.

650-659: $SLURM_LOCALID used without a default — style inconsistency with server.py.

server.py (Line 222) uses ${SLURM_LOCALID:-0} for the same arithmetic, making the default explicit. Inside $(( )) bash expands an unset variable to 0, so this is not a runtime bug, but aligning both sites aids readability.

♻️ Proposed consistency fix
-            f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
+            f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 659, The code
builds extra_arguments in the gpus_per_node > 1 branch using "$SLURM_LOCALID"
without a default; update the f-string in the multi-instance branch (where
gpus_per_node and extra_arguments are set and server_port/model are used) to use
"${SLURM_LOCALID:-0}" inside the arithmetic expression so it matches the
explicit default style used in server.py (i.e. replace "$(({server_port} +
$SLURM_LOCALID))" with "$(({server_port} + ${SLURM_LOCALID:-0}))").
nemo_skills/pipeline/utils/eval.py (1)

277-282: Multi-instance validation logic is correct; consider shorter inline messages (Ruff TRY003).

Static analysis flags Lines 279 and 281 for long messages defined outside a custom exception class. The logic itself is correct; the TRY003 note is minor.

♻️ Optional: extract messages to constants or a custom error
+class MultiInstanceConfigError(ValueError):
+    pass
+
 if gpus_per_node > 1:
     if num_chunks is None:
-        raise ValueError("gpus_per_node > 1 requires num_chunks to be specified")
+        raise MultiInstanceConfigError("gpus_per_node > 1 requires num_chunks to be specified")
     if num_chunks % gpus_per_node != 0:
-        raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})")
+        raise MultiInstanceConfigError(
+            f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 277 - 282, The
multi-instance validation is fine but the long inline error/log messages trigger
TRY003; shorten or extract them: in the block that checks gpus_per_node and
num_chunks (references: gpus_per_node, num_chunks, LOG), replace the long
f-strings with concise messages or move the full messages into module-level
constants or a custom exception class and raise/use those constants (e.g.,
change raise ValueError(...) and LOG.info(...) to use shorter strings or
NAME_ERROR_MSG constants) so the messages are not defined inline inside the
function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster_configs/eos_example.yaml`:
- Around line 47-48: The YAML sets mail_type: FAIL while mail_user is null which
lets exp.py (which only adds cluster_config.get("mail_user") when not None) emit
--mail-type=FAIL without --mail-user; update the cluster config by supplying a
non-null placeholder string for mail_user (e.g., your email or a clear
placeholder) so the SLURM sbatch invocation includes --mail-user and the intent
is explicit; ensure the mail_user field in this file is not empty to avoid
silent suppression by SLURM.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current logic in the is_shell_expr branch does a
blind string replace on base_donefile which can silently fail or modify
directory components; change it to operate on the filename only: extract the
directory and basename from donefiles[0] (use the basename for the replacement
of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"), verify the basename contains
the expected "_chunk_0.jsonl" pattern and raise or log an error if not found,
then rejoin dir + modified basename to form donefile_pattern, and finally append
that safe donefile_pattern to job_end_cmd (symbols: is_shell_expr,
base_donefile, donefiles, donefile_pattern, chunk_id, job_end_cmd).

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 620-632: In prepare_eval_commands (utils/eval.py) add a guard that
prevents using multiple GPUs per node with non-SLURM executors: detect when
job_gpus_per_node > 1 and executor != "slurm" and either raise a clear error or
force job_gpus_per_node = 1 (and log) to avoid the incorrect reuse of the same
chunk due to SLURM_LOCALID being unset; reference the symbols job_gpus_per_node,
executor, SLURM_LOCALID and base_chunk so the check is placed before commands
that compute $(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in
add_task.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 277-282: The multi-instance validation is fine but the long inline
error/log messages trigger TRY003; shorten or extract them: in the block that
checks gpus_per_node and num_chunks (references: gpus_per_node, num_chunks,
LOG), replace the long f-strings with concise messages or move the full messages
into module-level constants or a custom exception class and raise/use those
constants (e.g., change raise ValueError(...) and LOG.info(...) to use shorter
strings or NAME_ERROR_MSG constants) so the messages are not defined inline
inside the function.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 527-531: Remove the dead reassignment of output_file: the call to
get_chunked_rs_filename that sets output_file (in the block handling chunk_id)
is unused later, so delete that assignment; keep the existing logic that appends
or sets job_end_cmd using donefiles[chunk_id] and ensure no other code in the
surrounding function (e.g., where job_end_cmd, donefiles, chunk_id are used)
depends on output_file so removing the get_chunked_rs_filename call is safe.
- Around line 650-659: The code builds extra_arguments in the gpus_per_node > 1
branch using "$SLURM_LOCALID" without a default; update the f-string in the
multi-instance branch (where gpus_per_node and extra_arguments are set and
server_port/model are used) to use "${SLURM_LOCALID:-0}" inside the arithmetic
expression so it matches the explicit default style used in server.py (i.e.
replace "$(({server_port} + $SLURM_LOCALID))" with "$(({server_port} +
${SLURM_LOCALID:-0}))").

Comment on lines +515 to +525
if is_shell_expr:
# For shell expression, compute the donefile path at runtime
# Get the base pattern with _chunk_0 and replace with shell expression
base_donefile = donefiles[0] # e.g., /path/output_chunk_0.jsonl.done
# Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell
# Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))")
donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl")
if job_end_cmd:
job_end_cmd += f' && touch "{donefile_pattern}" '
else:
job_end_cmd = f'touch "{donefile_pattern}" '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and examine get_chunked_filename function
rg -n "def get_chunked_filename" --type py -A 10

# Also search for where donefiles is constructed
rg -n "donefiles\s*=" --type py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 792


🏁 Script executed:

#!/bin/bash
# Read the specific file and lines mentioned in the review
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '490,540p'

Repository: NVIDIA-NeMo/Skills

Length of output: 2945


🏁 Script executed:

#!/bin/bash
# Check how donefiles is built and what patterns are used
rg -n "donefiles" nemo_skills/pipeline/utils/generation.py -B 2 -A 2 | head -60

Repository: NVIDIA-NeMo/Skills

Length of output: 1663


🏁 Script executed:

#!/bin/bash
# Find get_chunked_rs_filename function
rg -n "def get_chunked_rs_filename" --type py -A 8

Repository: NVIDIA-NeMo/Skills

Length of output: 704


🏁 Script executed:

#!/bin/bash
# Also check imports in generation.py to see where functions come from
head -50 nemo_skills/pipeline/utils/generation.py | grep -E "^import|^from"

Repository: NVIDIA-NeMo/Skills

Length of output: 502


🏁 Script executed:

#!/bin/bash
# Read the full get_chunked_rs_filename implementation
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '152,175p'

Repository: NVIDIA-NeMo/Skills

Length of output: 947


🏁 Script executed:

#!/bin/bash
# Verify the exact pattern by checking what donefiles[0] would look like
# with different random_seed values
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '510,525p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1159


🏁 Script executed:

#!/bin/bash
# Check if there are any tests or usages that show actual donefile paths
rg "donefile" --type py -A 2 -B 2 | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 3276


🏁 Script executed:

#!/bin/bash
# Let me trace through the exact pattern by simulating what get_chunked_filename produces
cat -n nemo_skills/utils.py | sed -n '383,386p'

Repository: NVIDIA-NeMo/Skills

Length of output: 244


🏁 Script executed:

#!/bin/bash
# Verify if there could be any edge cases where _chunk_0.jsonl appears elsewhere
rg "_chunk_0" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 694


Fragile string replacement for shell-expression donefile path — no error if pattern is absent.

base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl") silently returns the original string when the pattern isn't found (e.g., naming convention change) and could replace in a directory component if the output path itself contains _chunk_0.jsonl. The touch would then create a file at the wrong path and the subsequent merge_chunks would stall waiting for a done file that never appears.

Limit the replacement to the basename only and validate the pattern was found:

🛡️ Proposed fix
-        base_donefile = donefiles[0]  # e.g., /path/output_chunk_0.jsonl.done
-        # Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell
-        # Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))")
-        donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl")
+        import os as _os
+        base_donefile = donefiles[0]
+        _dir, _fname = _os.path.split(base_donefile)
+        _new_fname = _fname.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl", 1)
+        if _new_fname == _fname:
+            raise RuntimeError(
+                f"Could not build shell-expression donefile from {base_donefile!r}; "
+                "'_chunk_0.jsonl' pattern not found in filename."
+            )
+        donefile_pattern = _os.path.join(_dir, _new_fname)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 515 - 525, The current
logic in the is_shell_expr branch does a blind string replace on base_donefile
which can silently fail or modify directory components; change it to operate on
the filename only: extract the directory and basename from donefiles[0] (use the
basename for the replacement of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"),
verify the basename contains the expected "_chunk_0.jsonl" pattern and raise or
log an error if not found, then rejoin dir + modified basename to form
donefile_pattern, and finally append that safe donefile_pattern to job_end_cmd
(symbols: is_shell_expr, base_donefile, donefiles, donefile_pattern, chunk_id,
job_end_cmd).

@rfejgin rfejgin force-pushed the pr1/multi-instance-pipeline branch from 2db65dc to 7cd1e66 Compare February 18, 2026 21:54
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin force-pushed the pr1/multi-instance-pipeline branch from 7cd1e66 to fbfdff9 Compare February 18, 2026 21:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
nemo_skills/pipeline/utils/server.py (1)

216-216: Shell quoting in echo is fragile but functional.

The echo statement interleaves single-quoted literals with unquoted variable expansions. While it works, it would break if SLURM_LOCALID or SLURM_PROCID contained spaces or special characters (unlikely for SLURM env vars, but worth noting).

♻️ Cleaner alternative
-                f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && "
+                f'echo "SLURM_LOCALID=$SLURM_LOCALID SLURM_PROCID=$SLURM_PROCID" && '
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/server.py` at line 216, The echo fragment
currently interleaves single-quoted literals and unquoted expansions (the part
that prints SLURM_LOCALID and SLURM_PROCID), which is fragile for values with
spaces/special characters; update the command to use a safely-quoted form (e.g.,
wrap the whole message in double quotes so the expansions are quoted, or use
printf with '%s\n') so SLURM_LOCALID and SLURM_PROCID are emitted correctly even
if they contain spaces or special chars.
nemo_skills/pipeline/utils/eval.py (1)

503-515: Consider using a dataclass or namedtuple for the job batch tuple.

The 9-element tuple is getting unwieldy and fragile — any reordering or addition requires synchronized changes in both the producer (here) and consumer (eval.py lines 606-616). The inline comment # TODO: move to a dataclass at line 502 acknowledges this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 515, Replace the
fragile 9-tuple appended into job_batches with a dedicated dataclass (e.g.,
JobBatch) containing fields for job_cmds, job_benchmarks, job_needs_sandbox,
job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address,
server_command_fn (from generation_task.get_server_command_fn()),
job_sandbox_env_overrides, and gpus_per_node; update the producer (where
job_batches.append(...) is called) to construct and append JobBatch instances
and update the consumer code that currently unpacks the tuple (the block that
iterates over job_batches and indexes those 9 elements) to access the named
attributes (job_batch.job_cmds, job_batch.server_command_fn, etc.), and add any
necessary import/definition of the dataclass near the top of eval.py so future
reordering or additions are type-safe and readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 650-655: The client-side port math in generation.py uses bare
$SLURM_LOCALID which can fail under set -u; update the string building that
produces extra_arguments (the branch where gpus_per_node > 1) to use the same
fallback syntax as server.py, replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0}
in the client port expression (the reference symbols are extra_arguments,
server_port and model) so the computed port is robust when SLURM_LOCALID is
unset.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: Add a validation in get_server_command to detect
unsupported multi-GPU-per-node configurations: if gpus_per_node > 1 and
server_type is not "generic" (i.e., for server_type values like "vllm",
"sglang", "trtllm", "megatron"), raise a clear ValueError explaining that
multi-instance mode is only supported for the "generic" server_type; place this
check near the top of get_server_command (after the existing parameter parsing
around line ~125) so the function fails fast rather than silently ignoring
gpus_per_node.

---

Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current fragile replacement using
base_donefile.replace("_chunk_0.jsonl", ...) can silently fail or alter
directory names; instead split base_donefile into dirname and basename, then
match/replace only the basename using a strict suffix or regex (e.g.,
r"_chunk_\d+\.jsonl$") to produce new_basename = re.sub(...,
f"_chunk_{chunk_id}.jsonl", basename); reconstruct donefile_pattern =
os.path.join(dirname, new_basename) and if the pattern isn't found, log or raise
an error rather than leaving the path unchanged; update the code that sets
donefile_pattern and job_end_cmd (references: is_shell_expr, base_donefile,
donefile_pattern, job_end_cmd, chunk_id) accordingly.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 503-515: Replace the fragile 9-tuple appended into job_batches
with a dedicated dataclass (e.g., JobBatch) containing fields for job_cmds,
job_benchmarks, job_needs_sandbox, job_needs_sandbox_to_keep_mounts,
job_server_config, job_server_address, server_command_fn (from
generation_task.get_server_command_fn()), job_sandbox_env_overrides, and
gpus_per_node; update the producer (where job_batches.append(...) is called) to
construct and append JobBatch instances and update the consumer code that
currently unpacks the tuple (the block that iterates over job_batches and
indexes those 9 elements) to access the named attributes (job_batch.job_cmds,
job_batch.server_command_fn, etc.), and add any necessary import/definition of
the dataclass near the top of eval.py so future reordering or additions are
type-safe and readable.

In `@nemo_skills/pipeline/utils/server.py`:
- Line 216: The echo fragment currently interleaves single-quoted literals and
unquoted expansions (the part that prints SLURM_LOCALID and SLURM_PROCID), which
is fragile for values with spaces/special characters; update the command to use
a safely-quoted form (e.g., wrap the whole message in double quotes so the
expansions are quoted, or use printf with '%s\n') so SLURM_LOCALID and
SLURM_PROCID are emitted correctly even if they contain spaces or special chars.

Comment on lines +650 to +655
if gpus_per_node > 1:
# Multi-instance mode: port is computed at runtime based on SLURM_LOCALID
extra_arguments = (
f"++server.host=127.0.0.1 "
f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent SLURM_LOCALID fallback between server and client port computation.

server.py (line 222) uses ${SLURM_LOCALID:-0} (with fallback to 0), but here the client uses bare $SLURM_LOCALID without a default. If SLURM_LOCALID is unset and set -u (or equivalent) is active, the client command will fail while the server succeeds.

🛡️ Proposed fix — use consistent fallback
-                f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
+                f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if gpus_per_node > 1:
# Multi-instance mode: port is computed at runtime based on SLURM_LOCALID
extra_arguments = (
f"++server.host=127.0.0.1 "
f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
)
if gpus_per_node > 1:
# Multi-instance mode: port is computed at runtime based on SLURM_LOCALID
extra_arguments = (
f"++server.host=127.0.0.1 "
f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 655, The
client-side port math in generation.py uses bare $SLURM_LOCALID which can fail
under set -u; update the string building that produces extra_arguments (the
branch where gpus_per_node > 1) to use the same fallback syntax as server.py,
replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0} in the client port expression
(the reference symbols are extra_arguments, server_port and model) so the
computed port is robust when SLURM_LOCALID is unset.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
nemo_skills/pipeline/utils/eval.py (2)

460-463: Use ${SLURM_LOCALID:-0} for consistency with server-side fallback.

The server side (server.py line 222) uses ${SLURM_LOCALID:-0}, but here the client uses bare $SLURM_LOCALID. While SLURM_LOCALID is always set by srun, using a consistent fallback avoids arithmetic errors if this code path is ever exercised outside a SLURM context.

🛡️ Proposed fix
                 effective_chunk_id = chunk_id
                 if gpus_per_node > 1:
-                    effective_chunk_id = f"$(({chunk_id} + $SLURM_LOCALID))"
+                    effective_chunk_id = f"$(({chunk_id} + ${{SLURM_LOCALID:-0}}))"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 460 - 463, The shell
arithmetic expression for multi-instance chunk IDs should use a safe SLURM
fallback like the server does: update the logic in eval.py where
effective_chunk_id is set (referencing variables chunk_id and gpus_per_node) so
that when gpus_per_node > 1 you produce the expression using ${SLURM_LOCALID:-0}
(e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of the bare
$SLURM_LOCALID to avoid arithmetic failures outside SLURM.

503-514: Job batch tuple correctly extended with gpus_per_node.

The 9-element tuple structure here matches the destructuring in eval.py (lines 606-616). Consider migrating this to a dataclass as the TODO on line 502 suggests — this would prevent positional mismatch bugs as the tuple grows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 514, The job batch
tuple appended to job_batches now includes gpus_per_node and risks
positional-mismatch bugs as it grows; define a dataclass (e.g., JobBatch) with
explicit fields for job_cmds, job_benchmarks, job_needs_sandbox,
job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address,
server_command_fn, job_sandbox_env_overrides, and gpus_per_node, then replace
the tuple construction in the job_batches.append(...) call with JobBatch(...)
(referencing generation_task.get_server_command_fn() for server_command_fn) and
update all sites that unpack the tuple (the destructuring around eval.py where
job_batches is iterated/unpacked) to use attribute access (e.g., batch.job_cmds)
instead of positional indices.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The replacement of "_chunk_0.jsonl" is fragile because it
operates on the full path (variable base_donefile) — change the logic in the
is_shell_expr branch to extract dirname and basename (use donefiles[0] to derive
base_donefile), perform the "_chunk_0.jsonl" replacement only on the basename,
validate that the basename contained the expected pattern and handle the failure
(raise/log/error out) if not found, then rejoin dirname and the new basename
into donefile_pattern and append that safe full path to job_end_cmd (references:
is_shell_expr, base_donefile/donefiles, chunk_id, donefile_pattern, job_end_cmd,
merge_chunks).
- Around line 650-655: The client-side multi-instance port computation in
nemo_skills/pipeline/utils/generation.py uses bare $SLURM_LOCALID which can fail
if SLURM_LOCALID is unset; update the extra_arguments construction used when
gpus_per_node > 1 to use the same safe fallback syntax as the server (e.g.,
${SLURM_LOCALID:-0}) so the computed port string becomes robust and consistent
with server.py's handling of SLURM_LOCALID.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: The function builds a multi-instance SLURM server only
for the "generic" server type but does not validate that gpus_per_node > 1 is
unsupported for other server types, causing client/server port mismatches; add a
fail-fast check near the top of the function that validates if gpus_per_node > 1
and server_type is not "generic" then raise a clear exception (e.g., ValueError)
mentioning server_type, gpus_per_node and that multi-instance mode is only
supported for "generic" (so callers in generation.py that offset ports per
SLURM_LOCALID won't silently fail); include this check before any
server_start_cmd logic that references server_entrypoint or SLURM_LOCALID.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 460-463: The shell arithmetic expression for multi-instance chunk
IDs should use a safe SLURM fallback like the server does: update the logic in
eval.py where effective_chunk_id is set (referencing variables chunk_id and
gpus_per_node) so that when gpus_per_node > 1 you produce the expression using
${SLURM_LOCALID:-0} (e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of
the bare $SLURM_LOCALID to avoid arithmetic failures outside SLURM.
- Around line 503-514: The job batch tuple appended to job_batches now includes
gpus_per_node and risks positional-mismatch bugs as it grows; define a dataclass
(e.g., JobBatch) with explicit fields for job_cmds, job_benchmarks,
job_needs_sandbox, job_needs_sandbox_to_keep_mounts, job_server_config,
job_server_address, server_command_fn, job_sandbox_env_overrides, and
gpus_per_node, then replace the tuple construction in the
job_batches.append(...) call with JobBatch(...) (referencing
generation_task.get_server_command_fn() for server_command_fn) and update all
sites that unpack the tuple (the destructuring around eval.py where job_batches
is iterated/unpacked) to use attribute access (e.g., batch.job_cmds) instead of
positional indices.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 220-242: Add a fast-fail validation immediately after the
gpus_per_node > 1 and server_type != "generic" guard to ensure gpus_per_node
does not exceed num_gpus: check the variables gpus_per_node and num_gpus and
raise a ValueError with a clear message like "gpus_per_node (X) cannot exceed
num_gpus (Y)" if gpus_per_node > num_gpus, so you avoid launching extra SLURM
tasks that bind to non-existent GPUs (this relates to the multi-instance path
that builds server_start_cmd using server_entrypoint and CUDA_VISIBLE_DEVICES).

Comment on lines +220 to +242
if gpus_per_node > 1:
# Multi-instance mode: each SLURM task gets its own GPU and port
server_start_cmd = (
f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && "
f"export CUDA_VISIBLE_DEVICES=${{SLURM_LOCALID:-0}} && "
f"{server_entrypoint} "
f" --model {model_path} "
f" --num_gpus 1 "
f" --num_nodes 1 "
f" --port $(({server_port} + ${{SLURM_LOCALID:-0}})) "
f" {server_args} "
)
num_tasks = gpus_per_node
else:
server_start_cmd = (
f"{server_entrypoint} "
f" --model {model_path} "
f" --num_gpus {num_gpus} "
f" --num_nodes {num_nodes} "
f" --port {server_port} "
f" {server_args} "
)
num_tasks = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current file structure and the changes
cd /tmp/repo 2>/dev/null || cd . 

# 1. Find the server.py file and verify it exists
find . -name "server.py" -path "*/pipeline/utils/*" 2>/dev/null | head -5

# 2. Check git diff if in a repo
if git rev-parse --git-dir > /dev/null 2>&1; then
  echo "=== Git status ==="
  git status --short
  echo "=== Checking git diff for server.py ==="
  git diff HEAD -- nemo_skills/pipeline/utils/server.py 2>/dev/null | head -150
else
  echo "Not in a git repo"
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 172


🏁 Script executed:

#!/bin/bash
# 3. Search for gpus_per_node in the codebase to see if it's validated
echo "=== Searching for gpus_per_node usage ==="
rg "gpus_per_node" --type py -B 3 -A 3

# 4. Search for any assertions or validations related to GPU counts
echo "=== Searching for GPU count validations ==="
rg "(gpus_per_node|num_gpus)" --type py | rg -E "(assert|raise|if.*<=|if.*>=|if.*<|if.*>)" -B 1 -A 1

Repository: NVIDIA-NeMo/Skills

Length of output: 21948


🏁 Script executed:

#!/bin/bash
# 5. Check the server.py file around lines 220-242 to see current state
echo "=== Reading server.py around lines 220-242 ==="
if [ -f "nemo_skills/pipeline/utils/server.py" ]; then
  sed -n '210,250p' nemo_skills/pipeline/utils/server.py | cat -n
else
  echo "File not found, searching..."
  find . -name "server.py" -path "*/pipeline/utils/*" -exec sed -n '210,250p' {} \; 2>/dev/null
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 2001


🏁 Script executed:

#!/bin/bash
# 6. Check how num_tasks is used after being set
echo "=== Checking num_tasks usage in server.py ==="
rg "num_tasks" nemo_skills/pipeline/utils/server.py -B 2 -A 2

# 7. Check exp.py for num_tasks usage
echo "=== Checking num_tasks in exp.py ==="
rg "num_tasks" nemo_skills/pipeline/utils/exp.py -B 2 -A 2 2>/dev/null || echo "exp.py patterns not found"

# 8. Search broader for num_tasks usage
echo "=== Broader num_tasks search ==="
rg "num_tasks" --type py -l 2>/dev/null | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 2518


🏁 Script executed:

#!/bin/bash
# 9. Look at the full function context around lines 220-242 to understand what this is
echo "=== Full context of the function ==="
if [ -f "nemo_skills/pipeline/utils/server.py" ]; then
  sed -n '100,260p' nemo_skills/pipeline/utils/server.py | cat -n
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 7162


Add validation for gpus_per_node <= num_gpus in multi-instance mode.

The num_tasks = 1 assignment for single-instance generic (line 143) is intentional and consistent with vllm, sglang, and trtllm implementations—no issue here.

However, the multi-instance path (lines 121–133) lacks validation that gpus_per_node ≤ num_gpus. If a user passes gpus_per_node=8 but num_gpus=4, the code launches 8 SLURM tasks with CUDA_VISIBLE_DEVICES=${SLURM_LOCALID:-0}, where SLURM_LOCALID ranges 0–7. Tasks with IDs 4–7 will fail at runtime trying to access non-existent GPUs. Add an explicit guard to fail fast with a clear error message:

if gpus_per_node > num_gpus:
    raise ValueError(
        f"gpus_per_node ({gpus_per_node}) cannot exceed num_gpus ({num_gpus})"
    )

Place this check immediately after the gpus_per_node > 1 and server_type != "generic" validation block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/server.py` around lines 220 - 242, Add a fast-fail
validation immediately after the gpus_per_node > 1 and server_type != "generic"
guard to ensure gpus_per_node does not exceed num_gpus: check the variables
gpus_per_node and num_gpus and raise a ValueError with a clear message like
"gpus_per_node (X) cannot exceed num_gpus (Y)" if gpus_per_node > num_gpus, so
you avoid launching extra SLURM tasks that bind to non-existent GPUs (this
relates to the multi-instance path that builds server_start_cmd using
server_entrypoint and CUDA_VISIBLE_DEVICES).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants