Conversation
Signed-off-by: Malay Nagda <malayn@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Malay Nagda <malayn@nvidia.com>
|
Thanks! |
Signed-off-by: malay-nagda <malayn@nvidia.com>
ko3n1g
left a comment
There was a problem hiding this comment.
Can we rather fail if either num-GPUs or GPU is recognized?
|
📝 WalkthroughWalkthroughChanges introduce a GPU-per-node mapping system based on GPU type, replacing the hardcoded default of 8 with dynamic inference via a lookup table. The configuration default is changed to None, with fallback logic added to scripts to resolve the value before passing to executors. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/performance/utils/utils.py (1)
29-35: Centralize GPU-per-node resolution in this module to avoid duplicated fallback logic.Lines 29-35 add the map, but inference is reimplemented in multiple entrypoints. A shared resolver here will keep behavior consistent.
♻️ Proposed refactor
NUM_GPUS_PER_NODE_MAP = { "h100": 8, "b200": 8, "b300": 8, "gb200": 4, "gb300": 4, } + + +def resolve_gpus_per_node(gpu: str, gpus_per_node: int | None) -> int: + """Resolve effective GPUs-per-node from CLI value and GPU type.""" + if gpus_per_node is not None: + return gpus_per_node + return NUM_GPUS_PER_NODE_MAP.get(gpu, 8)Then replace duplicated blocks in
scripts/performance/setup_experiment.pyand
examples/evaluation/launch_evaluation_pipeline.pywithresolve_gpus_per_node(...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/utils/utils.py` around lines 29 - 35, Add a centralized GPU-per-node resolver in this module: keep the existing NUM_GPUS_PER_NODE_MAP and implement a function named resolve_gpus_per_node(node_type: str, default: int | None = None) that looks up node_type in NUM_GPUS_PER_NODE_MAP and returns the mapped int or the provided default (or raises a clear ValueError if no default). Update callers to use resolve_gpus_per_node instead of duplicating fallback logic—specifically replace the GPU-resolution blocks currently duplicated in scripts/performance/setup_experiment.py and examples/evaluation/launch_evaluation_pipeline.py with calls to resolve_gpus_per_node(node_type, default).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/argument_parser.py`:
- Around line 383-388: The --gpus_per_node add_argument currently allows
0/negative values; add a small validator and use it as the argument type so
parsing fails early. Create a helper like positive_int(value) that converts to
int and raises argparse.ArgumentTypeError if value <= 0, then replace the
existing add_argument(..., "--gpus_per_node", type=int, ...) to use
type=positive_int in the argument_parser (the add_argument call for
"--gpus_per_node" and any parse_args or parse function that references it). This
ensures invalid values are rejected at parse time with a clear error.
---
Nitpick comments:
In `@scripts/performance/utils/utils.py`:
- Around line 29-35: Add a centralized GPU-per-node resolver in this module:
keep the existing NUM_GPUS_PER_NODE_MAP and implement a function named
resolve_gpus_per_node(node_type: str, default: int | None = None) that looks up
node_type in NUM_GPUS_PER_NODE_MAP and returns the mapped int or the provided
default (or raises a clear ValueError if no default). Update callers to use
resolve_gpus_per_node instead of duplicating fallback logic—specifically replace
the GPU-resolution blocks currently duplicated in
scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with calls to
resolve_gpus_per_node(node_type, default).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/evaluation/launch_evaluation_pipeline.pyscripts/performance/README.mdscripts/performance/argument_parser.pyscripts/performance/setup_experiment.pyscripts/performance/utils/utils.py
| "-gn", | ||
| "--gpus_per_node", | ||
| type=int, | ||
| help="Number of gpus per node. Defaults to 8", | ||
| default=8, | ||
| help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", | ||
| default=None, | ||
| ) |
There was a problem hiding this comment.
Validate --gpus_per_node as a positive integer at parse time.
Line 385 currently accepts 0/negative values, which can break downstream node calculation (num_gpus // gpus_per_node) and produce runtime failures.
🛠️ Proposed fix
def list_of_ints(arg):
@@
return result
+
+
+def positive_int(arg: str) -> int:
+ """Parse a strictly positive integer CLI value."""
+ value = int(arg)
+ if value < 1:
+ raise argparse.ArgumentTypeError("value must be >= 1")
+ return value
@@
slurm_args.add_argument(
"-gn",
"--gpus_per_node",
- type=int,
+ type=positive_int,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "-gn", | |
| "--gpus_per_node", | |
| type=int, | |
| help="Number of gpus per node. Defaults to 8", | |
| default=8, | |
| help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", | |
| default=None, | |
| ) | |
| "-gn", | |
| "--gpus_per_node", | |
| type=positive_int, | |
| help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", | |
| default=None, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/argument_parser.py` around lines 383 - 388, The
--gpus_per_node add_argument currently allows 0/negative values; add a small
validator and use it as the argument type so parsing fails early. Create a
helper like positive_int(value) that converts to int and raises
argparse.ArgumentTypeError if value <= 0, then replace the existing
add_argument(..., "--gpus_per_node", type=int, ...) to use type=positive_int in
the argument_parser (the add_argument call for "--gpus_per_node" and any
parse_args or parse function that references it). This ensures invalid values
are rejected at parse time with a clear error.
📝 WalkthroughWalkthroughThe PR introduces a GPU type-to-count mapping ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 85-90: The block that tries to infer gpus_per_node is dead/unsafe
because the argparser already defaults --gpus_per_node to 8 and there is no
--gpu argument; update either the parser or the logic: either remove the entire
conditional that references args.gpu and use args.gpus_per_node directly, or
modify the argparse setup to add a --gpu option and change the --gpus_per_node
default to None so the inference path can run; if you keep the inference, guard
access to args.gpu (e.g., check hasattr(args, "gpu")) before using
NUM_GPUS_PER_NODE_MAP to avoid AttributeError. Use the symbols gpus_per_node,
args.gpus_per_node, args.gpu, and NUM_GPUS_PER_NODE_MAP to locate and change the
code.
In `@scripts/performance/argument_parser.py`:
- Around line 383-387: The --gpus_per_node argparse argument currently allows
non-positive integers; add validation so gpus_per_node must be either None or a
positive int (>0). Implement a small validator (e.g., a custom type function
like positive_int or a check in the argument parsing flow) and use it for the
"--gpus_per_node" ("-gn", "--gpus_per_node") argument to raise
argparse.ArgumentTypeError for 0 or negative values, keeping default=None and
leaving None handling for downstream inference unchanged.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/evaluation/launch_evaluation_pipeline.pyscripts/performance/README.mdscripts/performance/argument_parser.pyscripts/performance/setup_experiment.pyscripts/performance/utils/utils.py
| "-gn", | ||
| "--gpus_per_node", | ||
| type=int, | ||
| help="Number of gpus per node. Defaults to 8", | ||
| default=8, | ||
| help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", | ||
| default=None, |
There was a problem hiding this comment.
Validate --gpus_per_node as a positive integer.
Line 385 currently accepts any integer. 0 can trigger division by zero in downstream node calculation, and negative values can produce invalid node counts.
🛠️ Proposed fix
+def positive_int(arg: str) -> int:
+ value = int(arg)
+ if value <= 0:
+ raise argparse.ArgumentTypeError("Value must be a positive integer")
+ return value
+
...
slurm_args.add_argument(
"-gn",
"--gpus_per_node",
- type=int,
+ type=positive_int,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/argument_parser.py` around lines 383 - 387, The
--gpus_per_node argparse argument currently allows non-positive integers; add
validation so gpus_per_node must be either None or a positive int (>0).
Implement a small validator (e.g., a custom type function like positive_int or a
check in the argument parsing flow) and use it for the "--gpus_per_node" ("-gn",
"--gpus_per_node") argument to raise argparse.ArgumentTypeError for 0 or
negative values, keeping default=None and leaving None handling for downstream
inference unchanged.
What does this PR do ?
Auto-select number of GPUs per node so that user does not have to input the arg in most cases. Set to 8 if
args.gpuis not recognized.Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Documentation