Conversation
Signed-off-by: oliver könig <okoenig@nvidia.com>
📝 WalkthroughWalkthroughThe changes introduce configurable model parallel size parameters to the evaluation deployment workflow. Deploy.sh now accepts tensor, pipeline, and context parallel sizes as positional arguments instead of using hard-coded values, and launch_evaluation_pipeline.py is reformatted to pass these parameters when invoking the deployment script. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/evaluation/deploy.sh (2)
1-4: 🛠️ Refactor suggestion | 🟠 MajorMissing NVIDIA copyright header and shebang line.
As per coding guidelines, shell scripts must include the NVIDIA copyright header at the top. Additionally, per Google Shell Style Guide, a shebang line should be present.
Proposed fix
+#!/bin/bash +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # Unset SLURM/PMI/PMIX env vars to prevent MPI initialization issues
12-22: 🛠️ Refactor suggestion | 🟠 MajorUse
uv runinstead of barepython.As per coding guidelines, scripts matching
{**/*.sh,examples/**/*.py}should useuv runto execute scripts instead of callingpythondirectly.Also, there's trailing whitespace on line 22 after
"$CP".Proposed fix
-python \ +uv run python \ /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint "$MEGATRON_CHECKPOINT" \ --model_id megatron_model \ --host 0.0.0.0 \ --port 8000 \ --num_gpus "$NUM_GPUS" \ --num_replicas "$NUM_REPLICAS" \ --tensor_model_parallel_size "$TP" \ --pipeline_model_parallel_size "$PP" \ - --context_parallel_size "$CP" + --context_parallel_size "$CP"examples/evaluation/launch_evaluation_pipeline.py (1)
158-165:⚠️ Potential issue | 🟠 MajorPotential
UnboundLocalErrorif no runs are found.If
runsis empty (line 158 condition is falsy), the variablerun_idwill not be defined, causing anUnboundLocalErroron line 165 when passed towandb.init().Proposed fix
if runs: run_id = runs[0].id print(f"Found run with ID: {run_id}") + else: + run_id = None + print("No existing run found, creating new run") wandb_run = wandb.init( project=args.wandb_project_name, entity=args.wandb_entity_name, - id=run_id, + id=run_id, # None will create a new run resume="allow", )
🧹 Nitpick comments (2)
examples/evaluation/deploy.sh (1)
6-11: Consider adding input validation for required parameters.The script now requires 6 positional arguments but has no validation. Missing arguments will silently pass empty strings to the Python script, causing cryptic errors.
Proposed fix
+if [[ $# -lt 6 ]]; then + echo "Usage: $0 MEGATRON_CHECKPOINT NUM_REPLICAS NUM_GPUS TP PP CP" >&2 + exit 1 +fi + MEGATRON_CHECKPOINT=$1 NUM_REPLICAS=$2 NUM_GPUS=$3 TP=$4 PP=$5 CP=$6examples/evaluation/launch_evaluation_pipeline.py (1)
14-14: Shebang should be on the first line.The shebang
#!/usr/bin/env python3on line 14 should be the very first line of the file, before the copyright header, for the script to be directly executable.Proposed fix
Move the shebang to line 1:
+#!/usr/bin/env python3 # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); ... # limitations under the License. -#!/usr/bin/env python3 """
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit