Add multi-instance pipeline support (gpus_per_node)#1247
Add multi-instance pipeline support (gpus_per_node)#1247
Conversation
📝 WalkthroughWalkthroughThis PR introduces multi-GPU per node support to the evaluation pipeline. It adds a Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant EvalPipeline as eval.py
participant Scheduler as Pipeline Scheduler
participant JobPrep as prepare_eval_commands
participant SlurmExec as Slurm Executor
participant ServerSetup as Server Setup
participant ClientSetup as Client Config
participant SlurmTask as SLURM Task<br/>(per GPU instance)
User->>EvalPipeline: eval(gpus_per_node=N)
EvalPipeline->>JobPrep: prepare_eval_commands(gpus_per_node=N)
JobPrep->>JobPrep: Validate num_chunks % N == 0
JobPrep->>JobPrep: Derive base_chunks<br/>from benchmark_chunk_ids
EvalPipeline->>Scheduler: Schedule tasks<br/>num_tasks=N per job
Scheduler->>SlurmExec: Configure job<br/>tasks_per_node > 1
SlurmExec->>SlurmExec: Set srun_wait=3600s<br/>for multi-instance
Scheduler->>ServerSetup: get_server_command(gpus_per_node=N)
ServerSetup->>ServerSetup: Build multi-instance startup<br/>per SLURM task
ServerSetup->>SlurmTask: CUDA_VISIBLE_DEVICES=$SLURM_LOCALID<br/>port=$base_port+$SLURM_LOCALID
Scheduler->>ClientSetup: configure_client(gpus_per_node=N)
ClientSetup->>ClientSetup: Compute port at runtime<br/>=$base_port+$SLURM_LOCALID
JobPrep->>JobPrep: Compute effective_chunk_id<br/>using $SLURM_LOCALID
SlurmTask->>SlurmTask: Execute generation cmd<br/>with runtime chunk indexing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
620-632:⚠️ Potential issue | 🟡 MinorSilent misconfiguration risk:
gpus_per_node > 1with a non-SLURM executor.When
executor != "slurm",SLURM_LOCALIDis unset and the shell expression$(({base_chunk} + $SLURM_LOCALID))evaluates to{base_chunk}for every task (all tasks process the same chunk). Consider adding an early guard:🛡️ Suggested guard in
prepare_eval_commands(utils/eval.py)if gpus_per_node > 1: if num_chunks is None: raise ValueError("gpus_per_node > 1 requires num_chunks to be specified") if num_chunks % gpus_per_node != 0: raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})") + if cluster_config.get("executor") != "slurm": + raise ValueError("gpus_per_node > 1 is only supported for slurm executor")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 620 - 632, In prepare_eval_commands (utils/eval.py) add a guard that prevents using multiple GPUs per node with non-SLURM executors: detect when job_gpus_per_node > 1 and executor != "slurm" and either raise a clear error or force job_gpus_per_node = 1 (and log) to avoid the incorrect reuse of the same chunk due to SLURM_LOCALID being unset; reference the symbols job_gpus_per_node, executor, SLURM_LOCALID and base_chunk so the check is placed before commands that compute $(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in add_task.
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/generation.py (2)
527-531: Deadoutput_filereassignment — leftover from refactoring.
output_fileis reassigned at Line 527 but nothing downstream in theif chunk_id is not None:block reads it (job_end_cmdnow usesdonefiles[chunk_id]directly). This assignment can be removed.♻️ Proposed cleanup
else: - output_file = get_chunked_rs_filename(output_dir, random_seed=random_seed, chunk_id=chunk_id) - if job_end_cmd: + if job_end_cmd:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 527 - 531, Remove the dead reassignment of output_file: the call to get_chunked_rs_filename that sets output_file (in the block handling chunk_id) is unused later, so delete that assignment; keep the existing logic that appends or sets job_end_cmd using donefiles[chunk_id] and ensure no other code in the surrounding function (e.g., where job_end_cmd, donefiles, chunk_id are used) depends on output_file so removing the get_chunked_rs_filename call is safe.
650-659:$SLURM_LOCALIDused without a default — style inconsistency withserver.py.
server.py(Line 222) uses${SLURM_LOCALID:-0}for the same arithmetic, making the default explicit. Inside$(( ))bash expands an unset variable to0, so this is not a runtime bug, but aligning both sites aids readability.♻️ Proposed consistency fix
- f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}' + f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 659, The code builds extra_arguments in the gpus_per_node > 1 branch using "$SLURM_LOCALID" without a default; update the f-string in the multi-instance branch (where gpus_per_node and extra_arguments are set and server_port/model are used) to use "${SLURM_LOCALID:-0}" inside the arithmetic expression so it matches the explicit default style used in server.py (i.e. replace "$(({server_port} + $SLURM_LOCALID))" with "$(({server_port} + ${SLURM_LOCALID:-0}))").nemo_skills/pipeline/utils/eval.py (1)
277-282: Multi-instance validation logic is correct; consider shorter inline messages (Ruff TRY003).Static analysis flags Lines 279 and 281 for long messages defined outside a custom exception class. The logic itself is correct; the TRY003 note is minor.
♻️ Optional: extract messages to constants or a custom error
+class MultiInstanceConfigError(ValueError): + pass + if gpus_per_node > 1: if num_chunks is None: - raise ValueError("gpus_per_node > 1 requires num_chunks to be specified") + raise MultiInstanceConfigError("gpus_per_node > 1 requires num_chunks to be specified") if num_chunks % gpus_per_node != 0: - raise ValueError(f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})") + raise MultiInstanceConfigError( + f"num_chunks ({num_chunks}) must be a multiple of gpus_per_node ({gpus_per_node})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 277 - 282, The multi-instance validation is fine but the long inline error/log messages trigger TRY003; shorten or extract them: in the block that checks gpus_per_node and num_chunks (references: gpus_per_node, num_chunks, LOG), replace the long f-strings with concise messages or move the full messages into module-level constants or a custom exception class and raise/use those constants (e.g., change raise ValueError(...) and LOG.info(...) to use shorter strings or NAME_ERROR_MSG constants) so the messages are not defined inline inside the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cluster_configs/eos_example.yaml`:
- Around line 47-48: The YAML sets mail_type: FAIL while mail_user is null which
lets exp.py (which only adds cluster_config.get("mail_user") when not None) emit
--mail-type=FAIL without --mail-user; update the cluster config by supplying a
non-null placeholder string for mail_user (e.g., your email or a clear
placeholder) so the SLURM sbatch invocation includes --mail-user and the intent
is explicit; ensure the mail_user field in this file is not empty to avoid
silent suppression by SLURM.
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current logic in the is_shell_expr branch does a
blind string replace on base_donefile which can silently fail or modify
directory components; change it to operate on the filename only: extract the
directory and basename from donefiles[0] (use the basename for the replacement
of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"), verify the basename contains
the expected "_chunk_0.jsonl" pattern and raise or log an error if not found,
then rejoin dir + modified basename to form donefile_pattern, and finally append
that safe donefile_pattern to job_end_cmd (symbols: is_shell_expr,
base_donefile, donefiles, donefile_pattern, chunk_id, job_end_cmd).
---
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 620-632: In prepare_eval_commands (utils/eval.py) add a guard that
prevents using multiple GPUs per node with non-SLURM executors: detect when
job_gpus_per_node > 1 and executor != "slurm" and either raise a clear error or
force job_gpus_per_node = 1 (and log) to avoid the incorrect reuse of the same
chunk due to SLURM_LOCALID being unset; reference the symbols job_gpus_per_node,
executor, SLURM_LOCALID and base_chunk so the check is placed before commands
that compute $(({base_chunk} + $SLURM_LOCALID)) or before setting num_tasks in
add_task.
---
Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 277-282: The multi-instance validation is fine but the long inline
error/log messages trigger TRY003; shorten or extract them: in the block that
checks gpus_per_node and num_chunks (references: gpus_per_node, num_chunks,
LOG), replace the long f-strings with concise messages or move the full messages
into module-level constants or a custom exception class and raise/use those
constants (e.g., change raise ValueError(...) and LOG.info(...) to use shorter
strings or NAME_ERROR_MSG constants) so the messages are not defined inline
inside the function.
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 527-531: Remove the dead reassignment of output_file: the call to
get_chunked_rs_filename that sets output_file (in the block handling chunk_id)
is unused later, so delete that assignment; keep the existing logic that appends
or sets job_end_cmd using donefiles[chunk_id] and ensure no other code in the
surrounding function (e.g., where job_end_cmd, donefiles, chunk_id are used)
depends on output_file so removing the get_chunked_rs_filename call is safe.
- Around line 650-659: The code builds extra_arguments in the gpus_per_node > 1
branch using "$SLURM_LOCALID" without a default; update the f-string in the
multi-instance branch (where gpus_per_node and extra_arguments are set and
server_port/model are used) to use "${SLURM_LOCALID:-0}" inside the arithmetic
expression so it matches the explicit default style used in server.py (i.e.
replace "$(({server_port} + $SLURM_LOCALID))" with "$(({server_port} +
${SLURM_LOCALID:-0}))").
| if is_shell_expr: | ||
| # For shell expression, compute the donefile path at runtime | ||
| # Get the base pattern with _chunk_0 and replace with shell expression | ||
| base_donefile = donefiles[0] # e.g., /path/output_chunk_0.jsonl.done | ||
| # Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell | ||
| # Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))") | ||
| donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl") | ||
| if job_end_cmd: | ||
| job_end_cmd += f' && touch "{donefile_pattern}" ' | ||
| else: | ||
| job_end_cmd = f'touch "{donefile_pattern}" ' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find and examine get_chunked_filename function
rg -n "def get_chunked_filename" --type py -A 10
# Also search for where donefiles is constructed
rg -n "donefiles\s*=" --type py | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 792
🏁 Script executed:
#!/bin/bash
# Read the specific file and lines mentioned in the review
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '490,540p'Repository: NVIDIA-NeMo/Skills
Length of output: 2945
🏁 Script executed:
#!/bin/bash
# Check how donefiles is built and what patterns are used
rg -n "donefiles" nemo_skills/pipeline/utils/generation.py -B 2 -A 2 | head -60Repository: NVIDIA-NeMo/Skills
Length of output: 1663
🏁 Script executed:
#!/bin/bash
# Find get_chunked_rs_filename function
rg -n "def get_chunked_rs_filename" --type py -A 8Repository: NVIDIA-NeMo/Skills
Length of output: 704
🏁 Script executed:
#!/bin/bash
# Also check imports in generation.py to see where functions come from
head -50 nemo_skills/pipeline/utils/generation.py | grep -E "^import|^from"Repository: NVIDIA-NeMo/Skills
Length of output: 502
🏁 Script executed:
#!/bin/bash
# Read the full get_chunked_rs_filename implementation
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '152,175p'Repository: NVIDIA-NeMo/Skills
Length of output: 947
🏁 Script executed:
#!/bin/bash
# Verify the exact pattern by checking what donefiles[0] would look like
# with different random_seed values
cat -n nemo_skills/pipeline/utils/generation.py | sed -n '510,525p'Repository: NVIDIA-NeMo/Skills
Length of output: 1159
🏁 Script executed:
#!/bin/bash
# Check if there are any tests or usages that show actual donefile paths
rg "donefile" --type py -A 2 -B 2 | head -40Repository: NVIDIA-NeMo/Skills
Length of output: 3276
🏁 Script executed:
#!/bin/bash
# Let me trace through the exact pattern by simulating what get_chunked_filename produces
cat -n nemo_skills/utils.py | sed -n '383,386p'Repository: NVIDIA-NeMo/Skills
Length of output: 244
🏁 Script executed:
#!/bin/bash
# Verify if there could be any edge cases where _chunk_0.jsonl appears elsewhere
rg "_chunk_0" --type pyRepository: NVIDIA-NeMo/Skills
Length of output: 694
Fragile string replacement for shell-expression donefile path — no error if pattern is absent.
base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl") silently returns the original string when the pattern isn't found (e.g., naming convention change) and could replace in a directory component if the output path itself contains _chunk_0.jsonl. The touch would then create a file at the wrong path and the subsequent merge_chunks would stall waiting for a done file that never appears.
Limit the replacement to the basename only and validate the pattern was found:
🛡️ Proposed fix
- base_donefile = donefiles[0] # e.g., /path/output_chunk_0.jsonl.done
- # Replace "_chunk_0.jsonl" with "_chunk_$((expr)).jsonl" where expr is expanded by shell
- # Extract the expression part (e.g., "0 + $SLURM_LOCALID" from "$((0 + $SLURM_LOCALID))")
- donefile_pattern = base_donefile.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl")
+ import os as _os
+ base_donefile = donefiles[0]
+ _dir, _fname = _os.path.split(base_donefile)
+ _new_fname = _fname.replace("_chunk_0.jsonl", f"_chunk_{chunk_id}.jsonl", 1)
+ if _new_fname == _fname:
+ raise RuntimeError(
+ f"Could not build shell-expression donefile from {base_donefile!r}; "
+ "'_chunk_0.jsonl' pattern not found in filename."
+ )
+ donefile_pattern = _os.path.join(_dir, _new_fname)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/utils/generation.py` around lines 515 - 525, The current
logic in the is_shell_expr branch does a blind string replace on base_donefile
which can silently fail or modify directory components; change it to operate on
the filename only: extract the directory and basename from donefiles[0] (use the
basename for the replacement of "_chunk_0.jsonl" -> f"_chunk_{chunk_id}.jsonl"),
verify the basename contains the expected "_chunk_0.jsonl" pattern and raise or
log an error if not found, then rejoin dir + modified basename to form
donefile_pattern, and finally append that safe donefile_pattern to job_end_cmd
(symbols: is_shell_expr, base_donefile, donefiles, donefile_pattern, chunk_id,
job_end_cmd).
2db65dc to
7cd1e66
Compare
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Stop setting srun --wait by default; allow opt-in via cluster_config.srun_wait_seconds. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Add a large srun --wait for multi-instance runs to override nemo_run's default --wait=60, preventing premature termination when some ranks finish earlier. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Use a 1-hour default srun --wait for multi-instance runs to avoid premature task termination when chunk runtimes differ. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
7cd1e66 to
fbfdff9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
nemo_skills/pipeline/utils/server.py (1)
216-216: Shell quoting in echo is fragile but functional.The echo statement interleaves single-quoted literals with unquoted variable expansions. While it works, it would break if
SLURM_LOCALIDorSLURM_PROCIDcontained spaces or special characters (unlikely for SLURM env vars, but worth noting).♻️ Cleaner alternative
- f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && " + f'echo "SLURM_LOCALID=$SLURM_LOCALID SLURM_PROCID=$SLURM_PROCID" && '🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/server.py` at line 216, The echo fragment currently interleaves single-quoted literals and unquoted expansions (the part that prints SLURM_LOCALID and SLURM_PROCID), which is fragile for values with spaces/special characters; update the command to use a safely-quoted form (e.g., wrap the whole message in double quotes so the expansions are quoted, or use printf with '%s\n') so SLURM_LOCALID and SLURM_PROCID are emitted correctly even if they contain spaces or special chars.nemo_skills/pipeline/utils/eval.py (1)
503-515: Consider using a dataclass or namedtuple for the job batch tuple.The 9-element tuple is getting unwieldy and fragile — any reordering or addition requires synchronized changes in both the producer (here) and consumer (
eval.pylines 606-616). The inline comment# TODO: move to a dataclassat line 502 acknowledges this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 515, Replace the fragile 9-tuple appended into job_batches with a dedicated dataclass (e.g., JobBatch) containing fields for job_cmds, job_benchmarks, job_needs_sandbox, job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address, server_command_fn (from generation_task.get_server_command_fn()), job_sandbox_env_overrides, and gpus_per_node; update the producer (where job_batches.append(...) is called) to construct and append JobBatch instances and update the consumer code that currently unpacks the tuple (the block that iterates over job_batches and indexes those 9 elements) to access the named attributes (job_batch.job_cmds, job_batch.server_command_fn, etc.), and add any necessary import/definition of the dataclass near the top of eval.py so future reordering or additions are type-safe and readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 650-655: The client-side port math in generation.py uses bare
$SLURM_LOCALID which can fail under set -u; update the string building that
produces extra_arguments (the branch where gpus_per_node > 1) to use the same
fallback syntax as server.py, replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0}
in the client port expression (the reference symbols are extra_arguments,
server_port and model) so the computed port is robust when SLURM_LOCALID is
unset.
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: Add a validation in get_server_command to detect
unsupported multi-GPU-per-node configurations: if gpus_per_node > 1 and
server_type is not "generic" (i.e., for server_type values like "vllm",
"sglang", "trtllm", "megatron"), raise a clear ValueError explaining that
multi-instance mode is only supported for the "generic" server_type; place this
check near the top of get_server_command (after the existing parameter parsing
around line ~125) so the function fails fast rather than silently ignoring
gpus_per_node.
---
Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The current fragile replacement using
base_donefile.replace("_chunk_0.jsonl", ...) can silently fail or alter
directory names; instead split base_donefile into dirname and basename, then
match/replace only the basename using a strict suffix or regex (e.g.,
r"_chunk_\d+\.jsonl$") to produce new_basename = re.sub(...,
f"_chunk_{chunk_id}.jsonl", basename); reconstruct donefile_pattern =
os.path.join(dirname, new_basename) and if the pattern isn't found, log or raise
an error rather than leaving the path unchanged; update the code that sets
donefile_pattern and job_end_cmd (references: is_shell_expr, base_donefile,
donefile_pattern, job_end_cmd, chunk_id) accordingly.
---
Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 503-515: Replace the fragile 9-tuple appended into job_batches
with a dedicated dataclass (e.g., JobBatch) containing fields for job_cmds,
job_benchmarks, job_needs_sandbox, job_needs_sandbox_to_keep_mounts,
job_server_config, job_server_address, server_command_fn (from
generation_task.get_server_command_fn()), job_sandbox_env_overrides, and
gpus_per_node; update the producer (where job_batches.append(...) is called) to
construct and append JobBatch instances and update the consumer code that
currently unpacks the tuple (the block that iterates over job_batches and
indexes those 9 elements) to access the named attributes (job_batch.job_cmds,
job_batch.server_command_fn, etc.), and add any necessary import/definition of
the dataclass near the top of eval.py so future reordering or additions are
type-safe and readable.
In `@nemo_skills/pipeline/utils/server.py`:
- Line 216: The echo fragment currently interleaves single-quoted literals and
unquoted expansions (the part that prints SLURM_LOCALID and SLURM_PROCID), which
is fragile for values with spaces/special characters; update the command to use
a safely-quoted form (e.g., wrap the whole message in double quotes so the
expansions are quoted, or use printf with '%s\n') so SLURM_LOCALID and
SLURM_PROCID are emitted correctly even if they contain spaces or special chars.
| if gpus_per_node > 1: | ||
| # Multi-instance mode: port is computed at runtime based on SLURM_LOCALID | ||
| extra_arguments = ( | ||
| f"++server.host=127.0.0.1 " | ||
| f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}' | ||
| ) |
There was a problem hiding this comment.
Inconsistent SLURM_LOCALID fallback between server and client port computation.
server.py (line 222) uses ${SLURM_LOCALID:-0} (with fallback to 0), but here the client uses bare $SLURM_LOCALID without a default. If SLURM_LOCALID is unset and set -u (or equivalent) is active, the client command will fail while the server succeeds.
🛡️ Proposed fix — use consistent fallback
- f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}'
+ f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if gpus_per_node > 1: | |
| # Multi-instance mode: port is computed at runtime based on SLURM_LOCALID | |
| extra_arguments = ( | |
| f"++server.host=127.0.0.1 " | |
| f'"++server.port=$(({server_port} + $SLURM_LOCALID))" ++server.model={model} {extra_arguments}' | |
| ) | |
| if gpus_per_node > 1: | |
| # Multi-instance mode: port is computed at runtime based on SLURM_LOCALID | |
| extra_arguments = ( | |
| f"++server.host=127.0.0.1 " | |
| f'"++server.port=$(({server_port} + ${{SLURM_LOCALID:-0}}))" ++server.model={model} {extra_arguments}' | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/utils/generation.py` around lines 650 - 655, The
client-side port math in generation.py uses bare $SLURM_LOCALID which can fail
under set -u; update the string building that produces extra_arguments (the
branch where gpus_per_node > 1) to use the same fallback syntax as server.py,
replacing $SLURM_LOCALID with ${SLURM_LOCALID:-0} in the client port expression
(the reference symbols are extra_arguments, server_port and model) so the
computed port is robust when SLURM_LOCALID is unset.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
nemo_skills/pipeline/utils/eval.py (2)
460-463: Use${SLURM_LOCALID:-0}for consistency with server-side fallback.The server side (
server.pyline 222) uses${SLURM_LOCALID:-0}, but here the client uses bare$SLURM_LOCALID. WhileSLURM_LOCALIDis always set bysrun, using a consistent fallback avoids arithmetic errors if this code path is ever exercised outside a SLURM context.🛡️ Proposed fix
effective_chunk_id = chunk_id if gpus_per_node > 1: - effective_chunk_id = f"$(({chunk_id} + $SLURM_LOCALID))" + effective_chunk_id = f"$(({chunk_id} + ${{SLURM_LOCALID:-0}}))"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 460 - 463, The shell arithmetic expression for multi-instance chunk IDs should use a safe SLURM fallback like the server does: update the logic in eval.py where effective_chunk_id is set (referencing variables chunk_id and gpus_per_node) so that when gpus_per_node > 1 you produce the expression using ${SLURM_LOCALID:-0} (e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of the bare $SLURM_LOCALID to avoid arithmetic failures outside SLURM.
503-514: Job batch tuple correctly extended withgpus_per_node.The 9-element tuple structure here matches the destructuring in
eval.py(lines 606-616). Consider migrating this to a dataclass as the TODO on line 502 suggests — this would prevent positional mismatch bugs as the tuple grows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 503 - 514, The job batch tuple appended to job_batches now includes gpus_per_node and risks positional-mismatch bugs as it grows; define a dataclass (e.g., JobBatch) with explicit fields for job_cmds, job_benchmarks, job_needs_sandbox, job_needs_sandbox_to_keep_mounts, job_server_config, job_server_address, server_command_fn, job_sandbox_env_overrides, and gpus_per_node, then replace the tuple construction in the job_batches.append(...) call with JobBatch(...) (referencing generation_task.get_server_command_fn() for server_command_fn) and update all sites that unpack the tuple (the destructuring around eval.py where job_batches is iterated/unpacked) to use attribute access (e.g., batch.job_cmds) instead of positional indices.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 515-525: The replacement of "_chunk_0.jsonl" is fragile because it
operates on the full path (variable base_donefile) — change the logic in the
is_shell_expr branch to extract dirname and basename (use donefiles[0] to derive
base_donefile), perform the "_chunk_0.jsonl" replacement only on the basename,
validate that the basename contained the expected pattern and handle the failure
(raise/log/error out) if not found, then rejoin dirname and the new basename
into donefile_pattern and append that safe full path to job_end_cmd (references:
is_shell_expr, base_donefile/donefiles, chunk_id, donefile_pattern, job_end_cmd,
merge_chunks).
- Around line 650-655: The client-side multi-instance port computation in
nemo_skills/pipeline/utils/generation.py uses bare $SLURM_LOCALID which can fail
if SLURM_LOCALID is unset; update the extra_arguments construction used when
gpus_per_node > 1 to use the same safe fallback syntax as the server (e.g.,
${SLURM_LOCALID:-0}) so the computed port string becomes robust and consistent
with server.py's handling of SLURM_LOCALID.
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 213-235: The function builds a multi-instance SLURM server only
for the "generic" server type but does not validate that gpus_per_node > 1 is
unsupported for other server types, causing client/server port mismatches; add a
fail-fast check near the top of the function that validates if gpus_per_node > 1
and server_type is not "generic" then raise a clear exception (e.g., ValueError)
mentioning server_type, gpus_per_node and that multi-instance mode is only
supported for "generic" (so callers in generation.py that offset ports per
SLURM_LOCALID won't silently fail); include this check before any
server_start_cmd logic that references server_entrypoint or SLURM_LOCALID.
---
Nitpick comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 460-463: The shell arithmetic expression for multi-instance chunk
IDs should use a safe SLURM fallback like the server does: update the logic in
eval.py where effective_chunk_id is set (referencing variables chunk_id and
gpus_per_node) so that when gpus_per_node > 1 you produce the expression using
${SLURM_LOCALID:-0} (e.g., f"$(({chunk_id} + ${SLURM_LOCALID:-0}))") instead of
the bare $SLURM_LOCALID to avoid arithmetic failures outside SLURM.
- Around line 503-514: The job batch tuple appended to job_batches now includes
gpus_per_node and risks positional-mismatch bugs as it grows; define a dataclass
(e.g., JobBatch) with explicit fields for job_cmds, job_benchmarks,
job_needs_sandbox, job_needs_sandbox_to_keep_mounts, job_server_config,
job_server_address, server_command_fn, job_sandbox_env_overrides, and
gpus_per_node, then replace the tuple construction in the
job_batches.append(...) call with JobBatch(...) (referencing
generation_task.get_server_command_fn() for server_command_fn) and update all
sites that unpack the tuple (the destructuring around eval.py where job_batches
is iterated/unpacked) to use attribute access (e.g., batch.job_cmds) instead of
positional indices.
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 220-242: Add a fast-fail validation immediately after the
gpus_per_node > 1 and server_type != "generic" guard to ensure gpus_per_node
does not exceed num_gpus: check the variables gpus_per_node and num_gpus and
raise a ValueError with a clear message like "gpus_per_node (X) cannot exceed
num_gpus (Y)" if gpus_per_node > num_gpus, so you avoid launching extra SLURM
tasks that bind to non-existent GPUs (this relates to the multi-instance path
that builds server_start_cmd using server_entrypoint and CUDA_VISIBLE_DEVICES).
| if gpus_per_node > 1: | ||
| # Multi-instance mode: each SLURM task gets its own GPU and port | ||
| server_start_cmd = ( | ||
| f"echo 'SLURM_LOCALID='$SLURM_LOCALID' SLURM_PROCID='$SLURM_PROCID && " | ||
| f"export CUDA_VISIBLE_DEVICES=${{SLURM_LOCALID:-0}} && " | ||
| f"{server_entrypoint} " | ||
| f" --model {model_path} " | ||
| f" --num_gpus 1 " | ||
| f" --num_nodes 1 " | ||
| f" --port $(({server_port} + ${{SLURM_LOCALID:-0}})) " | ||
| f" {server_args} " | ||
| ) | ||
| num_tasks = gpus_per_node | ||
| else: | ||
| server_start_cmd = ( | ||
| f"{server_entrypoint} " | ||
| f" --model {model_path} " | ||
| f" --num_gpus {num_gpus} " | ||
| f" --num_nodes {num_nodes} " | ||
| f" --port {server_port} " | ||
| f" {server_args} " | ||
| ) | ||
| num_tasks = 1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the current file structure and the changes
cd /tmp/repo 2>/dev/null || cd .
# 1. Find the server.py file and verify it exists
find . -name "server.py" -path "*/pipeline/utils/*" 2>/dev/null | head -5
# 2. Check git diff if in a repo
if git rev-parse --git-dir > /dev/null 2>&1; then
echo "=== Git status ==="
git status --short
echo "=== Checking git diff for server.py ==="
git diff HEAD -- nemo_skills/pipeline/utils/server.py 2>/dev/null | head -150
else
echo "Not in a git repo"
fiRepository: NVIDIA-NeMo/Skills
Length of output: 172
🏁 Script executed:
#!/bin/bash
# 3. Search for gpus_per_node in the codebase to see if it's validated
echo "=== Searching for gpus_per_node usage ==="
rg "gpus_per_node" --type py -B 3 -A 3
# 4. Search for any assertions or validations related to GPU counts
echo "=== Searching for GPU count validations ==="
rg "(gpus_per_node|num_gpus)" --type py | rg -E "(assert|raise|if.*<=|if.*>=|if.*<|if.*>)" -B 1 -A 1Repository: NVIDIA-NeMo/Skills
Length of output: 21948
🏁 Script executed:
#!/bin/bash
# 5. Check the server.py file around lines 220-242 to see current state
echo "=== Reading server.py around lines 220-242 ==="
if [ -f "nemo_skills/pipeline/utils/server.py" ]; then
sed -n '210,250p' nemo_skills/pipeline/utils/server.py | cat -n
else
echo "File not found, searching..."
find . -name "server.py" -path "*/pipeline/utils/*" -exec sed -n '210,250p' {} \; 2>/dev/null
fiRepository: NVIDIA-NeMo/Skills
Length of output: 2001
🏁 Script executed:
#!/bin/bash
# 6. Check how num_tasks is used after being set
echo "=== Checking num_tasks usage in server.py ==="
rg "num_tasks" nemo_skills/pipeline/utils/server.py -B 2 -A 2
# 7. Check exp.py for num_tasks usage
echo "=== Checking num_tasks in exp.py ==="
rg "num_tasks" nemo_skills/pipeline/utils/exp.py -B 2 -A 2 2>/dev/null || echo "exp.py patterns not found"
# 8. Search broader for num_tasks usage
echo "=== Broader num_tasks search ==="
rg "num_tasks" --type py -l 2>/dev/null | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 2518
🏁 Script executed:
#!/bin/bash
# 9. Look at the full function context around lines 220-242 to understand what this is
echo "=== Full context of the function ==="
if [ -f "nemo_skills/pipeline/utils/server.py" ]; then
sed -n '100,260p' nemo_skills/pipeline/utils/server.py | cat -n
fiRepository: NVIDIA-NeMo/Skills
Length of output: 7162
Add validation for gpus_per_node <= num_gpus in multi-instance mode.
The num_tasks = 1 assignment for single-instance generic (line 143) is intentional and consistent with vllm, sglang, and trtllm implementations—no issue here.
However, the multi-instance path (lines 121–133) lacks validation that gpus_per_node ≤ num_gpus. If a user passes gpus_per_node=8 but num_gpus=4, the code launches 8 SLURM tasks with CUDA_VISIBLE_DEVICES=${SLURM_LOCALID:-0}, where SLURM_LOCALID ranges 0–7. Tasks with IDs 4–7 will fail at runtime trying to access non-existent GPUs. Add an explicit guard to fail fast with a clear error message:
if gpus_per_node > num_gpus:
raise ValueError(
f"gpus_per_node ({gpus_per_node}) cannot exceed num_gpus ({num_gpus})"
)Place this check immediately after the gpus_per_node > 1 and server_type != "generic" validation block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/utils/server.py` around lines 220 - 242, Add a fast-fail
validation immediately after the gpus_per_node > 1 and server_type != "generic"
guard to ensure gpus_per_node does not exceed num_gpus: check the variables
gpus_per_node and num_gpus and raise a ValueError with a clear message like
"gpus_per_node (X) cannot exceed num_gpus (Y)" if gpus_per_node > num_gpus, so
you avoid launching extra SLURM tasks that bind to non-existent GPUs (this
relates to the multi-instance path that builds server_start_cmd using
server_entrypoint and CUDA_VISIBLE_DEVICES).
Summary
gpus_per_nodemulti-instance mode for running multiple model instances per nodeSLURM_LOCALIDfor multi-instance jobssrun --waitconfiguration to avoid premature killing of multi-instance tasksFiles changed
nemo_skills/pipeline/eval.py— passgpus_per_nodethrough eval pipelinenemo_skills/pipeline/utils/eval.py— multi-instance chunking logicnemo_skills/pipeline/utils/exp.py— srun --wait for multi-instance jobsnemo_skills/pipeline/utils/generation.py— SLURM_LOCALID-based port in configure_clientnemo_skills/pipeline/utils/server.py— multi-instance server configcluster_configs/eos_example.yaml— EOS cluster config exampleTest plan
python -m pytest tests/to verify no regressionsgpus_per_node > 1on a Slurm cluster🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements