Conversation
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
chtruong814
left a comment
There was a problem hiding this comment.
Minor comment about copyright years. Please start using 2026 for new files.
| @@ -0,0 +1,69 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Please start using 2026 for the copyright year on new files.
📝 WalkthroughWalkthroughThis PR introduces an evaluation pipeline infrastructure for Megatron-Bridge. It adds CLI utilities, shell scripts, and Python modules to handle deployment configuration, evaluation job orchestration across SLURM and Kubernetes environments, and executor setup with appropriate cluster settings. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Pipeline as launch_evaluation_pipeline.py
participant Executor as Executor (SLURM/KubeRay)
participant RayJob as RayJob Orchestrator
participant Deploy as deploy.sh
participant Eval as eval.sh
participant Service as Ray/Megatron Service
participant Evaluator as nemo_evaluator
User->>Pipeline: Execute with config args
Pipeline->>Pipeline: parse_cli_args()
alt dgxc_cluster set
Pipeline->>Executor: kuberay_executor(config)
else
Pipeline->>Executor: slurm_executor(config)
end
Executor-->>Pipeline: Configured Executor
Pipeline->>RayJob: Create RayJob with composite bash command
Pipeline->>RayJob: Start job (deploy.sh + eval.sh)
RayJob->>Deploy: Execute deploy.sh
Deploy->>Service: Launch Ray + Megatron model
Service-->>Deploy: Service ready
RayJob->>Eval: Execute eval.sh
Eval->>Eval: check_endpoint() readiness
Eval->>Evaluator: Build ApiEndpoint, EvaluationTarget, ConfigParams
Evaluator->>Service: Send evaluation requests
Service-->>Evaluator: Model responses
Evaluator-->>Eval: Evaluation results
Eval->>Eval: Write results.yml
Pipeline->>RayJob: Monitor job status (RUNNING)
RayJob-->>Pipeline: Job complete, results available
Pipeline->>Pipeline: Read results.yml
alt WANDB configured
Pipeline->>Pipeline: Log results to WANDB
end
Pipeline->>RayJob: Stop job + cleanup
Pipeline-->>User: Return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@examples/evaluation/argument_parser.py`:
- Around line 168-173: The help text for the argument added via
logging_args.add_argument("--wandb_entity_name", ...) is incorrect (it currently
says "wandb project name"); update the help string to accurately describe the
entity (e.g., "wandb entity name" or "WandB entity/username for the project") so
the flag --wandb_entity_name clearly documents its purpose in the argument
parser.
In `@examples/evaluation/deploy.sh`:
- Around line 6-19: Wrap the positional variables in quotes to prevent word
splitting and glob expansion: update the python invocation in deploy.sh to use
"$MEGATRON_CHECKPOINT", "$NUM_REPLICAS", and "$NUM_GPUS" wherever they are
interpolated (the --megatron_checkpoint, --num_gpus, and --num_replicas flags)
so paths or values with spaces or glob characters are passed safely to
deploy_ray_inframework.py.
- Around line 1-4: The shell script is missing a shebang so its interpreter is
ambiguous; add a shebang as the very first line (for example "#!/usr/bin/env
bash") to guarantee POSIX/Bash behavior for the for-loop unsets (the lines using
env | grep ^SLURM_/PMI_/PMIX_ and unset -v), and optionally make the script
executable (chmod +x) after updating the file.
In `@examples/evaluation/eval.sh`:
- Around line 1-4: Add a POSIX bash shebang as the first line of the script
(e.g. #!/usr/bin/env bash) so the interpreter is explicit and portable; insert
it above the existing for loops that unset SLURM_/PMI_/PMIX_ (the lines
beginning with for i in $(env | grep ^SLURM_, for i in $(env | grep ^PMI_, and
for i in $(env | grep ^PMIX_)) and ensure the file remains executable (chmod +x)
after the change.
- Around line 30-35: The heredoc directly interpolates $PARALLELISM and
$OUTPUT_DIR into generated Python config, creating injection risk and path
issues; instead, stop injecting raw vars into the Python block—pass PARALLELISM
and OUTPUT_DIR as environment variables or CLI args and have the Python runtime
read them (e.g., os.environ.get("PARALLELISM") for numeric validation and
os.environ.get("OUTPUT_DIR") for path), validate/cast PARALLELISM to an int,
strip any leading slashes from OUTPUT_DIR or use os.path.join to construct
output_dir (avoid prepending a literal "/" in the heredoc), and ensure any
user-provided values are sanitized before use in functions that consume them.
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 146-155: The code assumes run_id exists after checking runs; if
runs is empty this causes a NameError when calling wandb.init. Fix by explicitly
setting run_id = runs[0].id if runs else None (or log and exit) and pass the id
to wandb.init only when run_id is not None (e.g., build kwargs = {"project":
args.wandb_project_name, "entity": args.wandb_entity_name, "resume": "allow"};
if run_id: kwargs["id"] = run_id; wandb.init(**kwargs)). Update the block around
runs, run_id, and wandb.init accordingly and add a log message when no matching
run is found.
- Around line 1-14: Move the shebang '#!/usr/bin/env python3' to be the very
first line of the file (no preceding blank lines or comments) so it precedes the
copyright header; ensure the rest of the header and file contents remain
unchanged and that the shebang is exactly '#!/usr/bin/env python3'.
- Around line 133-134: The code opens results.yml directly which can raise
FileNotFoundError if the evaluation didn't produce it; update the block that
reads os.path.join(args.output_dir, "results", "results.yml") to first check
existence (os.path.exists) or wrap the open/yaml.safe_load call in a try/except
FileNotFoundError, then handle the missing file by logging a clear error via the
existing logger, exiting gracefully, or setting a sensible default for results;
ensure you reference args.output_dir, the "results.yml" path, and the variable
results/yaml.safe_load when implementing the check/exception handling.
In `@examples/evaluation/utils/executors.py`:
- Around line 78-79: The parameters hf_token and custom_env_vars in the function
signature are typed as str and Dict[...] but default to None; change their types
to Optional[str] and Optional[Dict[str, str]] and import typing.Optional (or use
from typing import Optional) so the annotations match the None default; update
the signature(s) that include hf_token and custom_env_vars (same pattern as the
earlier slurm_executor fix) and adjust any callers or type checks if needed to
handle Optional values.
- Line 100: Replace the hardcoded developer path for "HF_HOME":
"/nemo-workspace/pagaray/hf_cache" with a parameterized or environment-driven
value; update the code that builds the environment dict (where "HF_HOME" is set)
to read from an optional function/class parameter or os.environ.get("HF_HOME",
"/nemo-workspace/.cache/huggingface") so callers can override it while using
"/nemo-workspace/.cache/huggingface" as the sensible default.
- Around line 23-33: The slurm_executor function uses mutable default arguments
(custom_mounts: List[str] = [] and custom_env_vars: Dict[str, str] = {}) and an
implicit Optional for hf_token; change the defaults to None (e.g.,
custom_mounts: Optional[List[str]] = None, custom_env_vars:
Optional[Dict[str,str]] = None, hf_token: Optional[str] = None) and then inside
slurm_executor immediately initialize them with: custom_mounts = custom_mounts
or [] and custom_env_vars = custom_env_vars or {} so callers behave the same
while avoiding mutable default pitfalls and making the Optional intent explicit.
- Around line 132-135: The image_pull_secrets value is hardcoded; change the
function that builds spec_kwargs (referenced by spec_kwargs and
image_pull_secrets in examples/evaluation/utils/executors.py) to accept a new
parameter (e.g., image_pull_secret or image_pull_secrets) and use that parameter
inside spec_kwargs instead of the literal
"dockerregistry-dockerregistry-pagaray-ngc"; make the parameter optional with a
sensible default (None or empty list), convert a single string input into the
expected list format if necessary, and ensure callers of the function are
updated to pass the appropriate secret name where needed.
🧹 Nitpick comments (3)
examples/evaluation/argument_builder.py (1)
20-22: Duplicate function definition.
list_of_stringsis defined identically inexamples/evaluation/argument_parser.py(lines 17-19) andscripts/performance/argument_parser.py(lines 29-31). Consider extracting to a shared utility module to avoid DRY violations.examples/evaluation/argument_parser.py (1)
17-24: Duplicate utility functions.Both
list_of_stringsandto_dictare identical to implementations inscripts/performance/argument_parser.py(lines 29-31 and 46-48). Consider consolidating into a shared utility module.examples/evaluation/launch_evaluation_pipeline.py (1)
113-116: Long command string reduces readability.The command string spans multiple shell operations with pipes and background processes. Consider breaking it into a multi-line string or a separate script for maintainability.
Proposed refactor
deploy_cmd = f"bash /opt/Megatron-Bridge/examples/evaluation/deploy.sh {args.megatron_checkpoint} {args.num_replicas} {args.num_gpus}" eval_cmd = f"bash /opt/Megatron-Bridge/examples/evaluation/eval.sh {args.output_dir} {args.parallelism}" command = ( f"{deploy_cmd} | tee -a deploy.log & " f"sleep 120; " f"{eval_cmd} | tee -a eval.log" ) job.start(command=command, workdir=None)
| logging_args.add_argument( | ||
| "--wandb_entity_name", | ||
| type=str, | ||
| help="wandb project name", | ||
| required=False, | ||
| ) |
There was a problem hiding this comment.
Copy-paste error in help text.
The help text for --wandb_entity_name says "wandb project name" but should describe the entity name.
Proposed fix
logging_args.add_argument(
"--wandb_entity_name",
type=str,
- help="wandb project name",
+ help="wandb entity name",
required=False,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logging_args.add_argument( | |
| "--wandb_entity_name", | |
| type=str, | |
| help="wandb project name", | |
| required=False, | |
| ) | |
| logging_args.add_argument( | |
| "--wandb_entity_name", | |
| type=str, | |
| help="wandb entity name", | |
| required=False, | |
| ) |
🤖 Prompt for AI Agents
In `@examples/evaluation/argument_parser.py` around lines 168 - 173, The help text
for the argument added via logging_args.add_argument("--wandb_entity_name", ...)
is incorrect (it currently says "wandb project name"); update the help string to
accurately describe the entity (e.g., "wandb entity name" or "WandB
entity/username for the project") so the flag --wandb_entity_name clearly
documents its purpose in the argument parser.
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | ||
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | ||
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | ||
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add a shebang line to specify the shell interpreter.
The script lacks a shebang, which can cause portability issues and unexpected behavior. Static analysis (SC2148) correctly flags this.
Proposed fix
+#!/bin/bash
+
# Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues
for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done
for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done
for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | |
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done | |
| #!/bin/bash | |
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | |
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🤖 Prompt for AI Agents
In `@examples/evaluation/deploy.sh` around lines 1 - 4, The shell script is
missing a shebang so its interpreter is ambiguous; add a shebang as the very
first line (for example "#!/usr/bin/env bash") to guarantee POSIX/Bash behavior
for the for-loop unsets (the lines using env | grep ^SLURM_/PMI_/PMIX_ and unset
-v), and optionally make the script executable (chmod +x) after updating the
file.
examples/evaluation/deploy.sh
Outdated
| MEGATRON_CHECKPOINT=$1 | ||
| NUM_REPLICAS=$2 | ||
| NUM_GPUS=$3 | ||
| python \ | ||
| /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ | ||
| --megatron_checkpoint $MEGATRON_CHECKPOINT \ | ||
| --model_id megatron_model \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 \ | ||
| --num_gpus $NUM_GPUS \ | ||
| --num_replicas $NUM_REPLICAS \ | ||
| --tensor_model_parallel_size 1 \ | ||
| --pipeline_model_parallel_size 1 \ | ||
| --context_parallel_size 1 No newline at end of file |
There was a problem hiding this comment.
Quote variables to prevent word splitting and glob expansion.
The variables $MEGATRON_CHECKPOINT, $NUM_REPLICAS, and $NUM_GPUS should be quoted to handle paths with spaces or special characters safely.
Proposed fix
MEGATRON_CHECKPOINT=$1
NUM_REPLICAS=$2
NUM_GPUS=$3
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
- --megatron_checkpoint $MEGATRON_CHECKPOINT \
+ --megatron_checkpoint "$MEGATRON_CHECKPOINT" \
--model_id megatron_model \
--host 0.0.0.0 \
--port 8000 \
- --num_gpus $NUM_GPUS \
- --num_replicas $NUM_REPLICAS \
+ --num_gpus "$NUM_GPUS" \
+ --num_replicas "$NUM_REPLICAS" \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1 \
- --context_parallel_size 1
+ --context_parallel_size 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| MEGATRON_CHECKPOINT=$1 | |
| NUM_REPLICAS=$2 | |
| NUM_GPUS=$3 | |
| python \ | |
| /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ | |
| --megatron_checkpoint $MEGATRON_CHECKPOINT \ | |
| --model_id megatron_model \ | |
| --host 0.0.0.0 \ | |
| --port 8000 \ | |
| --num_gpus $NUM_GPUS \ | |
| --num_replicas $NUM_REPLICAS \ | |
| --tensor_model_parallel_size 1 \ | |
| --pipeline_model_parallel_size 1 \ | |
| --context_parallel_size 1 | |
| MEGATRON_CHECKPOINT=$1 | |
| NUM_REPLICAS=$2 | |
| NUM_GPUS=$3 | |
| python \ | |
| /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ | |
| --megatron_checkpoint "$MEGATRON_CHECKPOINT" \ | |
| --model_id megatron_model \ | |
| --host 0.0.0.0 \ | |
| --port 8000 \ | |
| --num_gpus "$NUM_GPUS" \ | |
| --num_replicas "$NUM_REPLICAS" \ | |
| --tensor_model_parallel_size 1 \ | |
| --pipeline_model_parallel_size 1 \ | |
| --context_parallel_size 1 |
🤖 Prompt for AI Agents
In `@examples/evaluation/deploy.sh` around lines 6 - 19, Wrap the positional
variables in quotes to prevent word splitting and glob expansion: update the
python invocation in deploy.sh to use "$MEGATRON_CHECKPOINT", "$NUM_REPLICAS",
and "$NUM_GPUS" wherever they are interpolated (the --megatron_checkpoint,
--num_gpus, and --num_replicas flags) so paths or values with spaces or glob
characters are passed safely to deploy_ray_inframework.py.
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | ||
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | ||
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | ||
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add a shebang line to specify the shell interpreter.
Same as deploy.sh, this script needs a shebang for portability and to ensure the correct shell is used.
Proposed fix
+#!/bin/bash
+
# Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues
for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done
for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done
for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | |
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done | |
| #!/bin/bash | |
| # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues | |
| for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done | |
| for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🤖 Prompt for AI Agents
In `@examples/evaluation/eval.sh` around lines 1 - 4, Add a POSIX bash shebang as
the first line of the script (e.g. #!/usr/bin/env bash) so the interpreter is
explicit and portable; insert it above the existing for loops that unset
SLURM_/PMI_/PMIX_ (the lines beginning with for i in $(env | grep ^SLURM_, for i
in $(env | grep ^PMI_, and for i in $(env | grep ^PMIX_)) and ensure the file
remains executable (chmod +x) after the change.
| parallelism = $PARALLELISM | ||
| request_timeout = 1000 | ||
| temperature = None | ||
| top_p = None | ||
| top_k = None | ||
| output_dir = "/$OUTPUT_DIR/results/" |
There was a problem hiding this comment.
Shell variable interpolation in heredoc could cause issues.
$PARALLELISMand$OUTPUT_DIRare interpolated directly into Python code. If these come from untrusted sources, this could be a code injection vector.- Line 35: The path
"/$OUTPUT_DIR/results/"prepends/which will cause double slashes ifOUTPUT_DIRis already an absolute path (e.g.,/workspacebecomes//workspace/results/).
Proposed fix for path construction
-output_dir = "/$OUTPUT_DIR/results/"
+output_dir = "${OUTPUT_DIR}/results/"Or in Python, use os.path.join after passing the path as an environment variable instead of heredoc interpolation.
🤖 Prompt for AI Agents
In `@examples/evaluation/eval.sh` around lines 30 - 35, The heredoc directly
interpolates $PARALLELISM and $OUTPUT_DIR into generated Python config, creating
injection risk and path issues; instead, stop injecting raw vars into the Python
block—pass PARALLELISM and OUTPUT_DIR as environment variables or CLI args and
have the Python runtime read them (e.g., os.environ.get("PARALLELISM") for
numeric validation and os.environ.get("OUTPUT_DIR") for path), validate/cast
PARALLELISM to an int, strip any leading slashes from OUTPUT_DIR or use
os.path.join to construct output_dir (avoid prepending a literal "/" in the
heredoc), and ensure any user-provided values are sanitized before use in
functions that consume them.
| if runs: | ||
| run_id = runs[0].id | ||
| print(f"Found run with ID: {run_id}") | ||
|
|
||
| wandb_run = wandb.init( | ||
| project=args.wandb_project_name, | ||
| entity=args.wandb_entity_name, | ||
| id=run_id, | ||
| resume="allow", | ||
| ) |
There was a problem hiding this comment.
Potential NameError if no matching wandb run is found.
If runs is empty (no matching run found), run_id will be undefined when used on line 153, causing a NameError. The code should handle this case.
Proposed fix
if runs:
run_id = runs[0].id
print(f"Found run with ID: {run_id}")
+ else:
+ run_id = None
+ print("No existing run found, creating new run")
wandb_run = wandb.init(
project=args.wandb_project_name,
entity=args.wandb_entity_name,
- id=run_id,
+ id=run_id, # Will create new run if None
resume="allow",
)🤖 Prompt for AI Agents
In `@examples/evaluation/launch_evaluation_pipeline.py` around lines 146 - 155,
The code assumes run_id exists after checking runs; if runs is empty this causes
a NameError when calling wandb.init. Fix by explicitly setting run_id =
runs[0].id if runs else None (or log and exit) and pass the id to wandb.init
only when run_id is not None (e.g., build kwargs = {"project":
args.wandb_project_name, "entity": args.wandb_entity_name, "resume": "allow"};
if run_id: kwargs["id"] = run_id; wandb.init(**kwargs)). Update the block around
runs, run_id, and wandb.init accordingly and add a log message when no matching
run is found.
| def slurm_executor( | ||
| account: str, | ||
| partition: str, | ||
| nodes: int, | ||
| num_gpus_per_node: int, | ||
| time_limit: str = "00:30:00", | ||
| container_image: str = "nvcr.io/nvidia/nemo:dev", | ||
| custom_mounts: List[str] = [], | ||
| custom_env_vars: Dict[str, str] = {}, | ||
| hf_token: str = None, | ||
| ) -> run.SlurmExecutor: |
There was a problem hiding this comment.
Mutable default arguments and implicit Optional types.
Lines 30-31 use mutable defaults ([] and {}), which is a classic Python pitfall. Line 32 uses implicit Optional. Static analysis (B006, RUF013) correctly flags these.
Proposed fix
def slurm_executor(
account: str,
partition: str,
nodes: int,
num_gpus_per_node: int,
time_limit: str = "00:30:00",
container_image: str = "nvcr.io/nvidia/nemo:dev",
- custom_mounts: List[str] = [],
- custom_env_vars: Dict[str, str] = {},
- hf_token: str = None,
+ custom_mounts: List[str] | None = None,
+ custom_env_vars: Dict[str, str] | None = None,
+ hf_token: str | None = None,
) -> run.SlurmExecutor:Then initialize inside the function:
custom_mounts = custom_mounts or []
custom_env_vars = custom_env_vars or {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def slurm_executor( | |
| account: str, | |
| partition: str, | |
| nodes: int, | |
| num_gpus_per_node: int, | |
| time_limit: str = "00:30:00", | |
| container_image: str = "nvcr.io/nvidia/nemo:dev", | |
| custom_mounts: List[str] = [], | |
| custom_env_vars: Dict[str, str] = {}, | |
| hf_token: str = None, | |
| ) -> run.SlurmExecutor: | |
| def slurm_executor( | |
| account: str, | |
| partition: str, | |
| nodes: int, | |
| num_gpus_per_node: int, | |
| time_limit: str = "00:30:00", | |
| container_image: str = "nvcr.io/nvidia/nemo:dev", | |
| custom_mounts: List[str] | None = None, | |
| custom_env_vars: Dict[str, str] | None = None, | |
| hf_token: str | None = None, | |
| ) -> run.SlurmExecutor: |
🧰 Tools
🪛 Ruff (0.14.13)
30-30: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
31-31: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
32-32: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
In `@examples/evaluation/utils/executors.py` around lines 23 - 33, The
slurm_executor function uses mutable default arguments (custom_mounts: List[str]
= [] and custom_env_vars: Dict[str, str] = {}) and an implicit Optional for
hf_token; change the defaults to None (e.g., custom_mounts: Optional[List[str]]
= None, custom_env_vars: Optional[Dict[str,str]] = None, hf_token: Optional[str]
= None) and then inside slurm_executor immediately initialize them with:
custom_mounts = custom_mounts or [] and custom_env_vars = custom_env_vars or {}
so callers behave the same while avoiding mutable default pitfalls and making
the Optional intent explicit.
| hf_token: str = None, | ||
| custom_env_vars: Dict[str, str] = None, |
There was a problem hiding this comment.
Same implicit Optional issue.
Lines 78-79 have the same implicit Optional problem as the slurm_executor function.
Proposed fix
- hf_token: str = None,
- custom_env_vars: Dict[str, str] = None,
+ hf_token: str | None = None,
+ custom_env_vars: Dict[str, str] | None = None,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hf_token: str = None, | |
| custom_env_vars: Dict[str, str] = None, | |
| hf_token: str | None = None, | |
| custom_env_vars: Dict[str, str] | None = None, |
🧰 Tools
🪛 Ruff (0.14.13)
78-78: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
79-79: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
In `@examples/evaluation/utils/executors.py` around lines 78 - 79, The parameters
hf_token and custom_env_vars in the function signature are typed as str and
Dict[...] but default to None; change their types to Optional[str] and
Optional[Dict[str, str]] and import typing.Optional (or use from typing import
Optional) so the annotations match the None default; update the signature(s)
that include hf_token and custom_env_vars (same pattern as the earlier
slurm_executor fix) and adjust any callers or type checks if needed to handle
Optional values.
| "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", | ||
| "TOKENIZERS_PARALLELISM": "False", | ||
| "TRANSFORMERS_OFFLINE": "1", | ||
| "HF_HOME": "/nemo-workspace/pagaray/hf_cache", |
There was a problem hiding this comment.
Hardcoded developer-specific path.
HF_HOME is set to /nemo-workspace/pagaray/hf_cache, which appears to be a developer-specific path. This should be parameterized or use a more generic default like /nemo-workspace/.cache/huggingface.
Proposed fix
- "HF_HOME": "/nemo-workspace/pagaray/hf_cache",
+ "HF_HOME": "/nemo-workspace/.cache/huggingface",Or make it a parameter with a sensible default.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "HF_HOME": "/nemo-workspace/pagaray/hf_cache", | |
| "HF_HOME": "/nemo-workspace/.cache/huggingface", |
🤖 Prompt for AI Agents
In `@examples/evaluation/utils/executors.py` at line 100, Replace the hardcoded
developer path for "HF_HOME": "/nemo-workspace/pagaray/hf_cache" with a
parameterized or environment-driven value; update the code that builds the
environment dict (where "HF_HOME" is set) to read from an optional
function/class parameter or os.environ.get("HF_HOME",
"/nemo-workspace/.cache/huggingface") so callers can override it while using
"/nemo-workspace/.cache/huggingface" as the sensible default.
| spec_kwargs={ | ||
| "schedulerName": "runai-scheduler", | ||
| "image_pull_secrets": ["dockerregistry-dockerregistry-pagaray-ngc"], | ||
| }, # e.g. Run:ai |
There was a problem hiding this comment.
Hardcoded developer-specific image pull secret.
image_pull_secrets contains "dockerregistry-dockerregistry-pagaray-ngc", which includes a developer name. This should be parameterized.
Proposed fix
Add a parameter to the function:
def kuberay_executor(
nodes: int,
num_gpus_per_node: int,
dgxc_pvc_mount_path: str,
dgxc_pvc_claim_name: str,
namespace: str = "default",
ray_version: str = "2.43.0",
container_image: str = "",
head_cpu: str = "8",
head_memory: str = "32Gi",
hf_token: str | None = None,
custom_env_vars: Dict[str, str] | None = None,
+ image_pull_secrets: List[str] | None = None,
):Then use it in spec_kwargs:
spec_kwargs={
"schedulerName": "runai-scheduler",
- "image_pull_secrets": ["dockerregistry-dockerregistry-pagaray-ngc"],
+ "image_pull_secrets": image_pull_secrets or [],
},🤖 Prompt for AI Agents
In `@examples/evaluation/utils/executors.py` around lines 132 - 135, The
image_pull_secrets value is hardcoded; change the function that builds
spec_kwargs (referenced by spec_kwargs and image_pull_secrets in
examples/evaluation/utils/executors.py) to accept a new parameter (e.g.,
image_pull_secret or image_pull_secrets) and use that parameter inside
spec_kwargs instead of the literal "dockerregistry-dockerregistry-pagaray-ngc";
make the parameter optional with a sensible default (None or empty list),
convert a single string input into the expected list format if necessary, and
ensure callers of the function are updated to pass the appropriate secret name
where needed.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.