Skip to content

select_num_devices_per_node#2123

Merged
malay-nagda merged 7 commits intomainfrom
malay/select_num_devices_per_node
Feb 26, 2026
Merged

select_num_devices_per_node#2123
malay-nagda merged 7 commits intomainfrom
malay/select_num_devices_per_node

Conversation

@malay-nagda
Copy link
Copy Markdown
Contributor

@malay-nagda malay-nagda commented Jan 29, 2026

What does this PR do ?

Auto-select number of GPUs per node so that user does not have to input the arg in most cases. Set to 8 if args.gpu is not recognized.

Changelog

NUM_GPUS_PER_NODE_MAP = {
    "h100": 8,
    "b200": 8,
    "b300": 8,
    "gb200": 4,
    "gb300": 4,
}
gpus_per_node = args.gpus_per_node
    if gpus_per_node is None:
        if args.gpu in NUM_GPUS_PER_NODE_MAP:
            gpus_per_node = NUM_GPUS_PER_NODE_MAP[args.gpu]
        else:
            gpus_per_node = 8

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Automatic GPU-per-node configuration—the system now intelligently infers the optimal number of GPUs per node based on the selected GPU type rather than requiring manual specification.
  • Documentation

    • Updated help text and documentation to reflect that GPU-per-node values will be automatically inferred from GPU type when not explicitly provided.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jan 29, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda requested a review from ko3n1g January 29, 2026 10:35
ko3n1g
ko3n1g previously approved these changes Jan 29, 2026
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Jan 29, 2026

Thanks!

Signed-off-by: malay-nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda marked this pull request as ready for review February 25, 2026 12:02
@malay-nagda malay-nagda requested review from a team and erhoo82 as code owners February 25, 2026 12:02
@malay-nagda malay-nagda requested a review from ko3n1g February 25, 2026 12:02
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rather fail if either num-GPUs or GPU is recognized?

@malay-nagda
Copy link
Copy Markdown
Contributor Author

Can we rather fail if either num-GPUs or GPU is recognized?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Changes introduce a GPU-per-node mapping system based on GPU type, replacing the hardcoded default of 8 with dynamic inference via a lookup table. The configuration default is changed to None, with fallback logic added to scripts to resolve the value before passing to executors.

Changes

Cohort / File(s) Summary
GPU-per-node mapping definition
scripts/performance/utils/utils.py
Added new constant NUM_GPUS_PER_NODE_MAP mapping GPU types (h100, b200, b300 to 8; gb200, gb300 to 4) to per-node GPU counts.
Argument configuration and documentation
scripts/performance/argument_parser.py, scripts/performance/README.md
Changed --gpus_per_node default from 8 to None with updated help text indicating dynamic inference from GPU type when not provided.
Execution logic with fallback resolution
examples/evaluation/launch_evaluation_pipeline.py, scripts/performance/setup_experiment.py
Added fallback logic to resolve gpus_per_node: if None, lookup in NUM_GPUS_PER_NODE_MAP using GPU type, defaulting to 8 if not found. Updated executor invocations to use resolved value instead of potentially None argument.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

  • chore: Add evaluation pipeline #1876: Modifies the same files (examples/evaluation/launch_evaluation_pipeline.py) and implements similar GPU-per-node executor handling logic with imports.

Suggested reviewers

  • suiyoubi
  • erhoo82
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major GPU allocation feature but lacks test results, regression testing, and performance validation documentation. Add documentation of tests run, regression testing results, GPU mapping validation, fallback logic testing, and performance benchmarks before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'select_num_devices_per_node' directly describes the PR's main objective of automatically selecting GPUs per node based on GPU type, which aligns with the changeset's core functionality.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch malay/select_num_devices_per_node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/performance/utils/utils.py (1)

29-35: Centralize GPU-per-node resolution in this module to avoid duplicated fallback logic.

Lines 29-35 add the map, but inference is reimplemented in multiple entrypoints. A shared resolver here will keep behavior consistent.

♻️ Proposed refactor
 NUM_GPUS_PER_NODE_MAP = {
     "h100": 8,
     "b200": 8,
     "b300": 8,
     "gb200": 4,
     "gb300": 4,
 }
+
+
+def resolve_gpus_per_node(gpu: str, gpus_per_node: int | None) -> int:
+    """Resolve effective GPUs-per-node from CLI value and GPU type."""
+    if gpus_per_node is not None:
+        return gpus_per_node
+    return NUM_GPUS_PER_NODE_MAP.get(gpu, 8)

Then replace duplicated blocks in scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with resolve_gpus_per_node(...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/utils.py` around lines 29 - 35, Add a centralized
GPU-per-node resolver in this module: keep the existing NUM_GPUS_PER_NODE_MAP
and implement a function named resolve_gpus_per_node(node_type: str, default:
int | None = None) that looks up node_type in NUM_GPUS_PER_NODE_MAP and returns
the mapped int or the provided default (or raises a clear ValueError if no
default). Update callers to use resolve_gpus_per_node instead of duplicating
fallback logic—specifically replace the GPU-resolution blocks currently
duplicated in scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with calls to
resolve_gpus_per_node(node_type, default).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/argument_parser.py`:
- Around line 383-388: The --gpus_per_node add_argument currently allows
0/negative values; add a small validator and use it as the argument type so
parsing fails early. Create a helper like positive_int(value) that converts to
int and raises argparse.ArgumentTypeError if value <= 0, then replace the
existing add_argument(..., "--gpus_per_node", type=int, ...) to use
type=positive_int in the argument_parser (the add_argument call for
"--gpus_per_node" and any parse_args or parse function that references it). This
ensures invalid values are rejected at parse time with a clear error.

---

Nitpick comments:
In `@scripts/performance/utils/utils.py`:
- Around line 29-35: Add a centralized GPU-per-node resolver in this module:
keep the existing NUM_GPUS_PER_NODE_MAP and implement a function named
resolve_gpus_per_node(node_type: str, default: int | None = None) that looks up
node_type in NUM_GPUS_PER_NODE_MAP and returns the mapped int or the provided
default (or raises a clear ValueError if no default). Update callers to use
resolve_gpus_per_node instead of duplicating fallback logic—specifically replace
the GPU-resolution blocks currently duplicated in
scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with calls to
resolve_gpus_per_node(node_type, default).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b058b66 and 03b58ba.

📒 Files selected for processing (5)
  • examples/evaluation/launch_evaluation_pipeline.py
  • scripts/performance/README.md
  • scripts/performance/argument_parser.py
  • scripts/performance/setup_experiment.py
  • scripts/performance/utils/utils.py

Comment on lines 383 to 388
"-gn",
"--gpus_per_node",
type=int,
help="Number of gpus per node. Defaults to 8",
default=8,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --gpus_per_node as a positive integer at parse time.

Line 385 currently accepts 0/negative values, which can break downstream node calculation (num_gpus // gpus_per_node) and produce runtime failures.

🛠️ Proposed fix
 def list_of_ints(arg):
@@
     return result
+
+
+def positive_int(arg: str) -> int:
+    """Parse a strictly positive integer CLI value."""
+    value = int(arg)
+    if value < 1:
+        raise argparse.ArgumentTypeError("value must be >= 1")
+    return value
@@
     slurm_args.add_argument(
         "-gn",
         "--gpus_per_node",
-        type=int,
+        type=positive_int,
         help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
         default=None,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"-gn",
"--gpus_per_node",
type=int,
help="Number of gpus per node. Defaults to 8",
default=8,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
)
"-gn",
"--gpus_per_node",
type=positive_int,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/argument_parser.py` around lines 383 - 388, The
--gpus_per_node add_argument currently allows 0/negative values; add a small
validator and use it as the argument type so parsing fails early. Create a
helper like positive_int(value) that converts to int and raises
argparse.ArgumentTypeError if value <= 0, then replace the existing
add_argument(..., "--gpus_per_node", type=int, ...) to use type=positive_int in
the argument_parser (the add_argument call for "--gpus_per_node" and any
parse_args or parse function that references it). This ensures invalid values
are rejected at parse time with a clear error.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

The PR introduces a GPU type-to-count mapping (NUM_GPUS_PER_NODE_MAP) and updates multiple scripts to dynamically infer gpus_per_node from GPU type when not explicitly provided, replacing the previous hardcoded default of 8.

Changes

Cohort / File(s) Summary
GPU Mapping Definition
scripts/performance/utils/utils.py
Adds a new constant NUM_GPUS_PER_NODE_MAP mapping GPU type strings (h100, b200, b300, gb200, gb300) to their respective GPU counts per node.
Argument Configuration
scripts/performance/argument_parser.py, scripts/performance/README.md
Updates default value for the --gpus_per_node argument from 8 to None and documents that the value will be inferred from GPU type if not provided.
GPU Inference Logic
scripts/performance/setup_experiment.py, examples/evaluation/launch_evaluation_pipeline.py
Implements dynamic GPU-per-node computation using NUM_GPUS_PER_NODE_MAP when gpus_per_node is not explicitly provided; falls back to 8 if GPU type is unknown. Updates both Slurm and KubeRay execution paths to use the computed value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • chore: Add evaluation pipeline #1876: Main PR that updates GPU-per-node resolution in evaluation and performance scripts, introducing the same NUM_GPUS_PER_NODE_MAP constant and dynamic inference logic across related execution paths.

Suggested reviewers

  • suiyoubi
  • erhoo82
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR description lacks test results, validation evidence, or benchmarks for the GPU auto-selection feature and related code changes. Document comprehensive testing including GPU type mapping validation, fallback behavior verification, and integration testing with evaluation pipelines.
Title check ❓ Inconclusive The title 'select_num_devices_per_node' is vague and does not clearly convey the main purpose of the change, which is to auto-select GPUs per node based on GPU type when not explicitly provided. Use a more descriptive title that clearly explains the feature, such as 'Auto-infer GPUs per node from GPU type' or 'Add GPU-type-based gpus_per_node inference'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch malay/select_num_devices_per_node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 85-90: The block that tries to infer gpus_per_node is dead/unsafe
because the argparser already defaults --gpus_per_node to 8 and there is no
--gpu argument; update either the parser or the logic: either remove the entire
conditional that references args.gpu and use args.gpus_per_node directly, or
modify the argparse setup to add a --gpu option and change the --gpus_per_node
default to None so the inference path can run; if you keep the inference, guard
access to args.gpu (e.g., check hasattr(args, "gpu")) before using
NUM_GPUS_PER_NODE_MAP to avoid AttributeError. Use the symbols gpus_per_node,
args.gpus_per_node, args.gpu, and NUM_GPUS_PER_NODE_MAP to locate and change the
code.

In `@scripts/performance/argument_parser.py`:
- Around line 383-387: The --gpus_per_node argparse argument currently allows
non-positive integers; add validation so gpus_per_node must be either None or a
positive int (>0). Implement a small validator (e.g., a custom type function
like positive_int or a check in the argument parsing flow) and use it for the
"--gpus_per_node" ("-gn", "--gpus_per_node") argument to raise
argparse.ArgumentTypeError for 0 or negative values, keeping default=None and
leaving None handling for downstream inference unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b058b66 and 03b58ba.

📒 Files selected for processing (5)
  • examples/evaluation/launch_evaluation_pipeline.py
  • scripts/performance/README.md
  • scripts/performance/argument_parser.py
  • scripts/performance/setup_experiment.py
  • scripts/performance/utils/utils.py

Comment on lines 383 to +387
"-gn",
"--gpus_per_node",
type=int,
help="Number of gpus per node. Defaults to 8",
default=8,
help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
default=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --gpus_per_node as a positive integer.

Line 385 currently accepts any integer. 0 can trigger division by zero in downstream node calculation, and negative values can produce invalid node counts.

🛠️ Proposed fix
+def positive_int(arg: str) -> int:
+    value = int(arg)
+    if value <= 0:
+        raise argparse.ArgumentTypeError("Value must be a positive integer")
+    return value
+
 ...
     slurm_args.add_argument(
         "-gn",
         "--gpus_per_node",
-        type=int,
+        type=positive_int,
         help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
         default=None,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/argument_parser.py` around lines 383 - 387, The
--gpus_per_node argparse argument currently allows non-positive integers; add
validation so gpus_per_node must be either None or a positive int (>0).
Implement a small validator (e.g., a custom type function like positive_int or a
check in the argument parsing flow) and use it for the "--gpus_per_node" ("-gn",
"--gpus_per_node") argument to raise argparse.ArgumentTypeError for 0 or
negative values, keeping default=None and leaving None handling for downstream
inference unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants