select_num_devices_per_node by malay-nagda · Pull Request #2123 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-01-29T10:12:07Z

What does this PR do ?

Auto-select number of GPUs per node so that user does not have to input the arg in most cases. Set to 8 if args.gpu is not recognized.

Changelog

NUM_GPUS_PER_NODE_MAP = {
    "h100": 8,
    "b200": 8,
    "b300": 8,
    "gb200": 4,
    "gb300": 4,
}

gpus_per_node = args.gpus_per_node
    if gpus_per_node is None:
        if args.gpu in NUM_GPUS_PER_NODE_MAP:
            gpus_per_node = NUM_GPUS_PER_NODE_MAP[args.gpu]
        else:
            gpus_per_node = 8

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Automatic GPU-per-node configuration—the system now intelligently infers the optimal number of GPUs per node based on the selected GPU type rather than requiring manual specification.
Documentation
- Updated help text and documentation to reflect that GPU-per-node values will be automatically inferred from GPU type when not explicitly provided.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-01-29T10:12:11Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

ko3n1g · 2026-01-29T10:45:19Z

Thanks!

Signed-off-by: malay-nagda <malayn@nvidia.com>

Signed-off-by: Malay Nagda <malayn@nvidia.com>

ko3n1g

Can we rather fail if either num-GPUs or GPU is recognized?

malay-nagda · 2026-02-25T12:14:00Z

Can we rather fail if either num-GPUs or GPU is recognized?

we already error out if GPU is not recognized- https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/argument_parser.py#L503
Why do we want to fail if num-gpus is not recognized?

coderabbitai · 2026-02-25T12:15:15Z

📝 Walkthrough

Walkthrough

Changes introduce a GPU-per-node mapping system based on GPU type, replacing the hardcoded default of 8 with dynamic inference via a lookup table. The configuration default is changed to None, with fallback logic added to scripts to resolve the value before passing to executors.

Changes

Cohort / File(s)	Summary
GPU-per-node mapping definition `scripts/performance/utils/utils.py`	Added new constant `NUM_GPUS_PER_NODE_MAP` mapping GPU types (h100, b200, b300 to 8; gb200, gb300 to 4) to per-node GPU counts.
Argument configuration and documentation `scripts/performance/argument_parser.py`, `scripts/performance/README.md`	Changed `--gpus_per_node` default from 8 to None with updated help text indicating dynamic inference from GPU type when not provided.
Execution logic with fallback resolution `examples/evaluation/launch_evaluation_pipeline.py`, `scripts/performance/setup_experiment.py`	Added fallback logic to resolve `gpus_per_node`: if None, lookup in `NUM_GPUS_PER_NODE_MAP` using GPU type, defaulting to 8 if not found. Updated executor invocations to use resolved value instead of potentially None argument.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

chore: Add evaluation pipeline #1876: Modifies the same files (examples/evaluation/launch_evaluation_pipeline.py) and implements similar GPU-per-node executor handling logic with imports.

Suggested reviewers

suiyoubi
erhoo82

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major GPU allocation feature but lacks test results, regression testing, and performance validation documentation.	Add documentation of tests run, regression testing results, GPU mapping validation, fallback logic testing, and performance benchmarks before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'select_num_devices_per_node' directly describes the PR's main objective of automatically selecting GPUs per node based on GPU type, which aligns with the changeset's core functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/select_num_devices_per_node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/performance/utils/utils.py (1)

29-35: Centralize GPU-per-node resolution in this module to avoid duplicated fallback logic.

Lines 29-35 add the map, but inference is reimplemented in multiple entrypoints. A shared resolver here will keep behavior consistent.

♻️ Proposed refactor

 NUM_GPUS_PER_NODE_MAP = {
     "h100": 8,
     "b200": 8,
     "b300": 8,
     "gb200": 4,
     "gb300": 4,
 }
+
+
+def resolve_gpus_per_node(gpu: str, gpus_per_node: int | None) -> int:
+    """Resolve effective GPUs-per-node from CLI value and GPU type."""
+    if gpus_per_node is not None:
+        return gpus_per_node
+    return NUM_GPUS_PER_NODE_MAP.get(gpu, 8)

Then replace duplicated blocks in scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with resolve_gpus_per_node(...).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/utils.py` around lines 29 - 35, Add a centralized
GPU-per-node resolver in this module: keep the existing NUM_GPUS_PER_NODE_MAP
and implement a function named resolve_gpus_per_node(node_type: str, default:
int | None = None) that looks up node_type in NUM_GPUS_PER_NODE_MAP and returns
the mapped int or the provided default (or raises a clear ValueError if no
default). Update callers to use resolve_gpus_per_node instead of duplicating
fallback logic—specifically replace the GPU-resolution blocks currently
duplicated in scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with calls to
resolve_gpus_per_node(node_type, default).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/argument_parser.py`:
- Around line 383-388: The --gpus_per_node add_argument currently allows
0/negative values; add a small validator and use it as the argument type so
parsing fails early. Create a helper like positive_int(value) that converts to
int and raises argparse.ArgumentTypeError if value <= 0, then replace the
existing add_argument(..., "--gpus_per_node", type=int, ...) to use
type=positive_int in the argument_parser (the add_argument call for
"--gpus_per_node" and any parse_args or parse function that references it). This
ensures invalid values are rejected at parse time with a clear error.

---

Nitpick comments:
In `@scripts/performance/utils/utils.py`:
- Around line 29-35: Add a centralized GPU-per-node resolver in this module:
keep the existing NUM_GPUS_PER_NODE_MAP and implement a function named
resolve_gpus_per_node(node_type: str, default: int | None = None) that looks up
node_type in NUM_GPUS_PER_NODE_MAP and returns the mapped int or the provided
default (or raises a clear ValueError if no default). Update callers to use
resolve_gpus_per_node instead of duplicating fallback logic—specifically replace
the GPU-resolution blocks currently duplicated in
scripts/performance/setup_experiment.py and
examples/evaluation/launch_evaluation_pipeline.py with calls to
resolve_gpus_per_node(node_type, default).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b058b66 and 03b58ba.

📒 Files selected for processing (5)

examples/evaluation/launch_evaluation_pipeline.py
scripts/performance/README.md
scripts/performance/argument_parser.py
scripts/performance/setup_experiment.py
scripts/performance/utils/utils.py

coderabbitai · 2026-02-25T12:15:18Z

scripts/performance/argument_parser.py

        "-gn",
        "--gpus_per_node",
        type=int,
-        help="Number of gpus per node. Defaults to 8",
-        default=8,
+        help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
+        default=None,
    )


⚠️ Potential issue | 🟠 Major

Validate --gpus_per_node as a positive integer at parse time.

Line 385 currently accepts 0/negative values, which can break downstream node calculation (num_gpus // gpus_per_node) and produce runtime failures.

🛠️ Proposed fix

def list_of_ints(arg): @@ return result + + +def positive_int(arg: str) -> int: + """Parse a strictly positive integer CLI value.""" + value = int(arg) + if value < 1: + raise argparse.ArgumentTypeError("value must be >= 1") + return value @@ slurm_args.add_argument( "-gn", "--gpus_per_node", - type=int, + type=positive_int, help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", default=None, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"-gn",

"--gpus_per_node",

type=int,

help="Number of gpus per node. Defaults to 8",

default=8,

help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",

default=None,

)

"-gn",

"--gpus_per_node",

type=positive_int,

help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",

default=None,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/argument_parser.py` around lines 383 - 388, The --gpus_per_node add_argument currently allows 0/negative values; add a small validator and use it as the argument type so parsing fails early. Create a helper like positive_int(value) that converts to int and raises argparse.ArgumentTypeError if value <= 0, then replace the existing add_argument(..., "--gpus_per_node", type=int, ...) to use type=positive_int in the argument_parser (the add_argument call for "--gpus_per_node" and any parse_args or parse function that references it). This ensures invalid values are rejected at parse time with a clear error.

coderabbitai · 2026-02-25T12:20:27Z

📝 Walkthrough

Walkthrough

The PR introduces a GPU type-to-count mapping (NUM_GPUS_PER_NODE_MAP) and updates multiple scripts to dynamically infer gpus_per_node from GPU type when not explicitly provided, replacing the previous hardcoded default of 8.

Changes

Cohort / File(s)	Summary
GPU Mapping Definition `scripts/performance/utils/utils.py`	Adds a new constant `NUM_GPUS_PER_NODE_MAP` mapping GPU type strings (h100, b200, b300, gb200, gb300) to their respective GPU counts per node.
Argument Configuration `scripts/performance/argument_parser.py`, `scripts/performance/README.md`	Updates default value for the `--gpus_per_node` argument from 8 to None and documents that the value will be inferred from GPU type if not provided.
GPU Inference Logic `scripts/performance/setup_experiment.py`, `examples/evaluation/launch_evaluation_pipeline.py`	Implements dynamic GPU-per-node computation using `NUM_GPUS_PER_NODE_MAP` when `gpus_per_node` is not explicitly provided; falls back to 8 if GPU type is unknown. Updates both Slurm and KubeRay execution paths to use the computed value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

chore: Add evaluation pipeline #1876: Main PR that updates GPU-per-node resolution in evaluation and performance scripts, introducing the same NUM_GPUS_PER_NODE_MAP constant and dynamic inference logic across related execution paths.

Suggested reviewers

suiyoubi
erhoo82

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR description lacks test results, validation evidence, or benchmarks for the GPU auto-selection feature and related code changes.	Document comprehensive testing including GPU type mapping validation, fallback behavior verification, and integration testing with evaluation pipelines.
Title check	❓ Inconclusive	The title 'select_num_devices_per_node' is vague and does not clearly convey the main purpose of the change, which is to auto-select GPUs per node based on GPU type when not explicitly provided.	Use a more descriptive title that clearly explains the feature, such as 'Auto-infer GPUs per node from GPU type' or 'Add GPU-type-based gpus_per_node inference'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/select_num_devices_per_node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 85-90: The block that tries to infer gpus_per_node is dead/unsafe
because the argparser already defaults --gpus_per_node to 8 and there is no
--gpu argument; update either the parser or the logic: either remove the entire
conditional that references args.gpu and use args.gpus_per_node directly, or
modify the argparse setup to add a --gpu option and change the --gpus_per_node
default to None so the inference path can run; if you keep the inference, guard
access to args.gpu (e.g., check hasattr(args, "gpu")) before using
NUM_GPUS_PER_NODE_MAP to avoid AttributeError. Use the symbols gpus_per_node,
args.gpus_per_node, args.gpu, and NUM_GPUS_PER_NODE_MAP to locate and change the
code.

In `@scripts/performance/argument_parser.py`:
- Around line 383-387: The --gpus_per_node argparse argument currently allows
non-positive integers; add validation so gpus_per_node must be either None or a
positive int (>0). Implement a small validator (e.g., a custom type function
like positive_int or a check in the argument parsing flow) and use it for the
"--gpus_per_node" ("-gn", "--gpus_per_node") argument to raise
argparse.ArgumentTypeError for 0 or negative values, keeping default=None and
leaving None handling for downstream inference unchanged.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b058b66 and 03b58ba.

📒 Files selected for processing (5)

examples/evaluation/launch_evaluation_pipeline.py
scripts/performance/README.md
scripts/performance/argument_parser.py
scripts/performance/setup_experiment.py
scripts/performance/utils/utils.py

examples/evaluation/launch_evaluation_pipeline.py

coderabbitai · 2026-02-25T12:20:30Z

scripts/performance/argument_parser.py

        "-gn",
        "--gpus_per_node",
        type=int,
-        help="Number of gpus per node. Defaults to 8",
-        default=8,
+        help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.",
+        default=None,


⚠️ Potential issue | 🟠 Major

Validate --gpus_per_node as a positive integer.

Line 385 currently accepts any integer. 0 can trigger division by zero in downstream node calculation, and negative values can produce invalid node counts.

🛠️ Proposed fix

+def positive_int(arg: str) -> int: + value = int(arg) + if value <= 0: + raise argparse.ArgumentTypeError("Value must be a positive integer") + return value + ... slurm_args.add_argument( "-gn", "--gpus_per_node", - type=int, + type=positive_int, help="Number of gpus per node. Defaults to None. If not provided, will be inferred from the GPU type.", default=None, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/argument_parser.py` around lines 383 - 387, The --gpus_per_node argparse argument currently allows non-positive integers; add validation so gpus_per_node must be either None or a positive int (>0). Implement a small validator (e.g., a custom type function like positive_int or a check in the argument parsing flow) and use it for the "--gpus_per_node" ("-gn", "--gpus_per_node") argument to raise argparse.ArgumentTypeError for 0 or negative values, keeping default=None and leaving None handling for downstream inference unchanged.

select_num_devices_per_node

b7f7451

Signed-off-by: Malay Nagda <malayn@nvidia.com>

help comment

696a5ab

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda requested a review from ko3n1g January 29, 2026 10:35

ko3n1g previously approved these changes Jan 29, 2026

View reviewed changes

Merge branch 'main' into malay/select_num_devices_per_node

fa5388d

Signed-off-by: malay-nagda <malayn@nvidia.com>

malay-nagda dismissed ko3n1g’s stale review via fa5388d February 25, 2026 11:59

cleanup

03b58ba

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda marked this pull request as ready for review February 25, 2026 12:02

malay-nagda requested review from a team and erhoo82 as code owners February 25, 2026 12:02

malay-nagda requested a review from ko3n1g February 25, 2026 12:02

copy-pr-bot bot temporarily deployed to test February 25, 2026 12:03 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 12:04 Inactive

ko3n1g reviewed Feb 25, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 12:11 Inactive

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 12:21 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 13:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 13:35 Inactive

Merge branch 'main' into malay/select_num_devices_per_node

bde8160

Conversation

malay-nagda commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

ko3n1g commented Jan 29, 2026

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

malay-nagda commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 25, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malay-nagda commented Jan 29, 2026 •

edited by coderabbitai bot

Loading