chore(fix): Deployment parallelism by ko3n1g · Pull Request #2189 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-03T10:40:55Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Evaluation pipeline now supports configurable model parallelization parameters for enhanced flexibility. Users can customize tensor model parallel size, pipeline model parallel size, and context parallel size when launching evaluations, enabling optimized evaluation runs tailored to specific deployment scenarios, hardware configurations, and resource requirements.

Signed-off-by: oliver könig <okoenig@nvidia.com>

coderabbitai · 2026-02-03T10:46:24Z

📝 Walkthrough

Walkthrough

The changes introduce configurable model parallel size parameters to the evaluation deployment workflow. Deploy.sh now accepts tensor, pipeline, and context parallel sizes as positional arguments instead of using hard-coded values, and launch_evaluation_pipeline.py is reformatted to pass these parameters when invoking the deployment script.

Changes

Cohort / File(s)	Summary
Model Parallel Configuration `examples/evaluation/deploy.sh`	Introduces positional parameters TP, PP, CP for tensor, pipeline, and context model parallel sizes, replacing hard-coded flags with parametrized values in the deploy_ray_inframework.py call.
Evaluation Pipeline `examples/evaluation/launch_evaluation_pipeline.py`	Reformats evaluation job command to invoke deploy.sh with tensor, pipeline, and context parallel size arguments, restructuring from single-line to multi-line script format.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

chore: Add evaluation pipeline #1876: Directly related as it updates the evaluation deployment to accept and pass through tensor/pipeline/context parallel-size positional arguments to deploy.sh, modifying the same deployment command flow.

Suggested reviewers

suiyoubi
erhoo82
malay-nagda

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'chore(fix): Deployment parallelism' is vague and generic, using the broad term 'Deployment parallelism' without clearly specifying what deployment changes were made.	Clarify the title to specifically describe the changes, such as 'Add configurable model parallelism parameters to deployment scripts' or 'Make tensor/pipeline/context parallel sizes configurable in deployment'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	Changes are minor configuration parameterization for deployment parallelism settings without affecting core logic or test documentation requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ko3n1g/ci/fix-evaluation

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

examples/evaluation/deploy.sh (2)
1-4: 🛠️ Refactor suggestion | 🟠 Major

Missing NVIDIA copyright header and shebang line.

As per coding guidelines, shell scripts must include the NVIDIA copyright header at the top. Additionally, per Google Shell Style Guide, a shebang line should be present.
Proposed fix
+#!/bin/bash
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues
12-22: 🛠️ Refactor suggestion | 🟠 Major

Use uv run instead of bare python.

As per coding guidelines, scripts matching {**/*.sh,examples/**/*.py} should use uv run to execute scripts instead of calling python directly.

Also, there's trailing whitespace on line 22 after "$CP".
Proposed fix
-python \
+uv run python \
   /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint "$MEGATRON_CHECKPOINT" \
   --model_id megatron_model \
   --host 0.0.0.0 \
   --port 8000 \
   --num_gpus "$NUM_GPUS" \
   --num_replicas "$NUM_REPLICAS" \
   --tensor_model_parallel_size "$TP" \
   --pipeline_model_parallel_size "$PP" \
-  --context_parallel_size "$CP" 
+  --context_parallel_size "$CP"
examples/evaluation/launch_evaluation_pipeline.py (1)
158-165: ⚠️ Potential issue | 🟠 Major

Potential UnboundLocalError if no runs are found.

If runs is empty (line 158 condition is falsy), the variable run_id will not be defined, causing an UnboundLocalError on line 165 when passed to wandb.init().
Proposed fix
         if runs:
             run_id = runs[0].id
             print(f"Found run with ID: {run_id}")
+        else:
+            run_id = None
+            print("No existing run found, creating new run")

         wandb_run = wandb.init(
             project=args.wandb_project_name,
             entity=args.wandb_entity_name,
-            id=run_id,
+            id=run_id,  # None will create a new run
             resume="allow",
         )

🧹 Nitpick comments (2)

examples/evaluation/deploy.sh (1)
6-11: Consider adding input validation for required parameters.

The script now requires 6 positional arguments but has no validation. Missing arguments will silently pass empty strings to the Python script, causing cryptic errors.
Proposed fix
+if [[ $# -lt 6 ]]; then
+  echo "Usage: $0 MEGATRON_CHECKPOINT NUM_REPLICAS NUM_GPUS TP PP CP" >&2
+  exit 1
+fi
+
 MEGATRON_CHECKPOINT=$1
 NUM_REPLICAS=$2
 NUM_GPUS=$3
 TP=$4
 PP=$5
 CP=$6
examples/evaluation/launch_evaluation_pipeline.py (1)
14-14: Shebang should be on the first line.

The shebang #!/usr/bin/env python3 on line 14 should be the very first line of the file, before the copyright header, for the script to be directly executable.
Proposed fix

Move the shebang to line 1:
+#!/usr/bin/env python3
 # Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 ...
 # limitations under the License.
-#!/usr/bin/env python3
 """

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: pengdurice <pengduhit@gmail.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

chore(fix): Deployment parallelism

fdbb81d

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 10:41 Inactive

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

fix

e824e12

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested review from a team, erhoo82 and malay-nagda as code owners February 3, 2026 13:16

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 13:17 Inactive

malay-nagda previously approved these changes Feb 3, 2026

View reviewed changes

ko3n1g added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 3, 2026

Merge branch 'main' into ko3n1g/ci/fix-evaluation

cb0e835

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 10:41 Inactive

copy-pr-bot bot temporarily deployed to test February 17, 2026 10:41 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 10:47 Failure

copy-pr-bot bot temporarily deployed to public February 17, 2026 11:08 Inactive

expert_model_parallel_size

b79b08c

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g dismissed malay-nagda’s stale review via b79b08c February 17, 2026 13:08

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 13:09 Inactive

copy-pr-bot bot temporarily deployed to test February 17, 2026 13:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 13:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 13:17 Failure

revert

ccf5e9e

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 16:52 Inactive

ko3n1g merged commit 24cd876 into main Feb 17, 2026
12 checks passed

ko3n1g deleted the ko3n1g/ci/fix-evaluation branch February 17, 2026 16:53

copy-pr-bot bot temporarily deployed to test February 17, 2026 16:53 Inactive

ko3n1g added a commit that referenced this pull request Feb 17, 2026

chore(fix): Deployment parallelism (#2189)

02b5f5d

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Feb 17, 2026

cp: chore(fix): Deployment parallelism (2189) into r0.3.0 #2406

Closed

ko3n1g restored the ko3n1g/ci/fix-evaluation branch February 17, 2026 16:59

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:25 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 17:25 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:25 Inactive

pengdurice pushed a commit to pengdurice/Megatron-Bridge that referenced this pull request Feb 24, 2026

chore(fix): Deployment parallelism (NVIDIA-NeMo#2189)

a61ba98

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: pengdurice <pengduhit@gmail.com>

ko3n1g mentioned this pull request Feb 26, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

ko3n1g added a commit that referenced this pull request Feb 26, 2026

chore(fix): Deployment parallelism (#2189)

57cbd51

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026

chore(fix): Deployment parallelism (#2189)

1440031

Signed-off-by: oliver könig <okoenig@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(fix): Deployment parallelism#2189

chore(fix): Deployment parallelism#2189
ko3n1g merged 5 commits intomainfrom
ko3n1g/ci/fix-evaluation

ko3n1g commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Feb 3, 2026 •

edited by coderabbitai bot

Loading