Skip to content

chore(fix): Deployment parallelism#2189

Merged
ko3n1g merged 5 commits intomainfrom
ko3n1g/ci/fix-evaluation
Feb 17, 2026
Merged

chore(fix): Deployment parallelism#2189
ko3n1g merged 5 commits intomainfrom
ko3n1g/ci/fix-evaluation

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 3, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features
    • Evaluation pipeline now supports configurable model parallelization parameters for enhanced flexibility. Users can customize tensor model parallel size, pipeline model parallel size, and context parallel size when launching evaluations, enabling optimized evaluation runs tailored to specific deployment scenarios, hardware configurations, and resource requirements.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

The changes introduce configurable model parallel size parameters to the evaluation deployment workflow. Deploy.sh now accepts tensor, pipeline, and context parallel sizes as positional arguments instead of using hard-coded values, and launch_evaluation_pipeline.py is reformatted to pass these parameters when invoking the deployment script.

Changes

Cohort / File(s) Summary
Model Parallel Configuration
examples/evaluation/deploy.sh
Introduces positional parameters TP, PP, CP for tensor, pipeline, and context model parallel sizes, replacing hard-coded flags with parametrized values in the deploy_ray_inframework.py call.
Evaluation Pipeline
examples/evaluation/launch_evaluation_pipeline.py
Reformats evaluation job command to invoke deploy.sh with tensor, pipeline, and context parallel size arguments, restructuring from single-line to multi-line script format.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

  • chore: Add evaluation pipeline #1876: Directly related as it updates the evaluation deployment to accept and pass through tensor/pipeline/context parallel-size positional arguments to deploy.sh, modifying the same deployment command flow.

Suggested reviewers

  • suiyoubi
  • erhoo82
  • malay-nagda
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'chore(fix): Deployment parallelism' is vague and generic, using the broad term 'Deployment parallelism' without clearly specifying what deployment changes were made. Clarify the title to specifically describe the changes, such as 'Add configurable model parallelism parameters to deployment scripts' or 'Make tensor/pipeline/context parallel sizes configurable in deployment'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed Changes are minor configuration parameterization for deployment parallelism settings without affecting core logic or test documentation requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ko3n1g/ci/fix-evaluation

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
examples/evaluation/deploy.sh (2)

1-4: 🛠️ Refactor suggestion | 🟠 Major

Missing NVIDIA copyright header and shebang line.

As per coding guidelines, shell scripts must include the NVIDIA copyright header at the top. Additionally, per Google Shell Style Guide, a shebang line should be present.

Proposed fix
+#!/bin/bash
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues

12-22: 🛠️ Refactor suggestion | 🟠 Major

Use uv run instead of bare python.

As per coding guidelines, scripts matching {**/*.sh,examples/**/*.py} should use uv run to execute scripts instead of calling python directly.

Also, there's trailing whitespace on line 22 after "$CP".

Proposed fix
-python \
+uv run python \
   /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint "$MEGATRON_CHECKPOINT" \
   --model_id megatron_model \
   --host 0.0.0.0 \
   --port 8000 \
   --num_gpus "$NUM_GPUS" \
   --num_replicas "$NUM_REPLICAS" \
   --tensor_model_parallel_size "$TP" \
   --pipeline_model_parallel_size "$PP" \
-  --context_parallel_size "$CP" 
+  --context_parallel_size "$CP"
examples/evaluation/launch_evaluation_pipeline.py (1)

158-165: ⚠️ Potential issue | 🟠 Major

Potential UnboundLocalError if no runs are found.

If runs is empty (line 158 condition is falsy), the variable run_id will not be defined, causing an UnboundLocalError on line 165 when passed to wandb.init().

Proposed fix
         if runs:
             run_id = runs[0].id
             print(f"Found run with ID: {run_id}")
+        else:
+            run_id = None
+            print("No existing run found, creating new run")

         wandb_run = wandb.init(
             project=args.wandb_project_name,
             entity=args.wandb_entity_name,
-            id=run_id,
+            id=run_id,  # None will create a new run
             resume="allow",
         )
🧹 Nitpick comments (2)
examples/evaluation/deploy.sh (1)

6-11: Consider adding input validation for required parameters.

The script now requires 6 positional arguments but has no validation. Missing arguments will silently pass empty strings to the Python script, causing cryptic errors.

Proposed fix
+if [[ $# -lt 6 ]]; then
+  echo "Usage: $0 MEGATRON_CHECKPOINT NUM_REPLICAS NUM_GPUS TP PP CP" >&2
+  exit 1
+fi
+
 MEGATRON_CHECKPOINT=$1
 NUM_REPLICAS=$2
 NUM_GPUS=$3
 TP=$4
 PP=$5
 CP=$6
examples/evaluation/launch_evaluation_pipeline.py (1)

14-14: Shebang should be on the first line.

The shebang #!/usr/bin/env python3 on line 14 should be the very first line of the file, before the copyright header, for the script to be directly executable.

Proposed fix

Move the shebang to line 1:

+#!/usr/bin/env python3
 # Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 ...
 # limitations under the License.
-#!/usr/bin/env python3
 """

Signed-off-by: oliver könig <okoenig@nvidia.com>
malay-nagda
malay-nagda previously approved these changes Feb 3, 2026
@ko3n1g ko3n1g added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 3, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit 24cd876 into main Feb 17, 2026
12 checks passed
@ko3n1g ko3n1g deleted the ko3n1g/ci/fix-evaluation branch February 17, 2026 16:53
ko3n1g added a commit that referenced this pull request Feb 17, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g ko3n1g restored the ko3n1g/ci/fix-evaluation branch February 17, 2026 16:59
pengdurice pushed a commit to pengdurice/Megatron-Bridge that referenced this pull request Feb 24, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
@ko3n1g ko3n1g mentioned this pull request Feb 26, 2026
5 tasks
ko3n1g added a commit that referenced this pull request Feb 26, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants