cp: `ci(fix): Wheel build (2192)` into `r0.3.0` by ko3n1g · Pull Request #2238 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-05T15:04:05Z

beep boop [🤖]: Hi @ko3n1g 👋,

we've cherry picked #2192 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Chores
- Updated CI/CD workflow for building and publishing with enhanced configuration and improved result handling
- Updated dependencies including TransformerEngine (with CUDA support) and nvidia-resiliency-ext for better compatibility
Documentation
- Clarified return type annotations in model specifications

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g · 2026-02-05T15:04:08Z

/ok to test 9cd94dc

copy-pr-bot · 2026-02-05T15:04:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-05T15:08:13Z

📝 Walkthrough

Walkthrough

This PR updates the GitHub Actions workflow to enable the build-test-publish-wheel job with updated preflight and workflow versions, adds new input parameters and secrets configuration, updates the container image to PyTorch 25.11, and refreshes dependency specifications in pyproject.toml including TransformerEngine and nvidia-resiliency-ext versions.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflow `.github/workflows/build-test-publish-wheel.yml`	Enables build-test-publish-wheel and build-test-publish-wheel-summary jobs; updates preflight workflow reference from v0.64.2 to v0.69.1; switches build job to v0.70.1; adds comprehensive input parameters and secrets passthrough (TWINE_USERNAME, TWINE_PASSWORD, SLACK_WEBHOOK, GH_TOKEN); updates container image to nvcr.io/nvidia/pytorch:25.11-py3; implements dynamic runner configuration from pre-flight outputs; restores result handling with failed job detection.
Project Dependencies `pyproject.toml`	Updates TransformerEngine to require CUDA core support (`transformer-engine[pytorch,core_cu13]>=2.10.0a0,<2.13.0`) with Git reference override; updates nvidia-resiliency-ext to compatible revision (~=0.4.1); removes prerelease modelopt dependency; consolidates uv.sources configuration.
Documentation `src/megatron/bridge/training/mlm_compat/model.py`	Updates docstring return type annotation from generic `transformer_layer_spec` to `ModuleSpec` for clarity.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

ci(fix): Wheel build #2192: Directly related workflow changes enabling build-test-publish-wheel jobs, updating workflow references, container image, and runner configuration.

Suggested labels

r0.3.0

Suggested reviewers

thomasdhc

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes a cherry-pick operation of a CI fix for wheel build from PR `#2192` into the r0.3.0 branch, matching the pull request's main purpose and changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	This pull request contains minor changes focused on CI/CD infrastructure and dependency maintenance rather than major changes to core functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-2192-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/workflows/build-test-publish-wheel.yml:
- Line 85: The assignment to FAILED_JOBS needs a proper fallback when gh run
view fails: change the substitution so the fallback is produced inside the
command substitution (or set a default after assignment) — i.e., ensure the gh
run view invocation (the command producing the jq result) is followed by || echo
0 within the $(...) so FAILED_JOBS receives "0" on failure; reference the
FAILED_JOBS variable and the gh run view ... --json jobs --jq '[.jobs[] |
select(.status == "completed" and .conclusion != "success")] | length' command
when making this change.
- Around line 83-95: The Result step uses SKIPPING_IS_ALLOWED and the gh CLI
without defining required environment variables or authentication; add an env
block to that step that sets GH_TOKEN (for gh CLI authentication) and
SKIPPING_IS_ALLOWED (preserving existing workflow logic) so gh run view calls
succeed and the conditional [ "$SKIPPING_IS_ALLOWED" == "true" ] evaluates
correctly; update the step that contains the gh run view and condition to
include env: GH_TOKEN: ${{ secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{
env.SKIPPING_IS_ALLOWED }} (or the appropriate default) so the gh CLI and
SKIPPING_IS_ALLOWED reference are properly defined.

coderabbitai · 2026-02-05T15:08:16Z

.github/workflows/build-test-publish-wheel.yml

+      - name: Result
+        run: |
+          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0

-  #         if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
-  #             echo "✅ All previous jobs completed successfully"
-  #             exit 0
-  #         else
-  #             echo "❌ Found $FAILED_JOBS failed job(s)"
-  #             # Show which jobs failed
-  #             gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
-  #             exit 1
-  #         fi
+          if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
+              echo "✅ All previous jobs completed successfully"
+              exit 0
+          else
+              echo "❌ Found $FAILED_JOBS failed job(s)"
+              # Show which jobs failed
+              gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
+              exit 1
+          fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/build-test-publish-wheel.yml | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4295

🏁 Script executed:

# Search for SKIPPING_IS_ALLOWED and GH_TOKEN/GITHUB_TOKEN usage throughout the workflow echo "=== Searching for SKIPPING_IS_ALLOWED ===" grep -n "SKIPPING_IS_ALLOWED" .github/workflows/build-test-publish-wheel.yml echo -e "\n=== Searching for GH_TOKEN or GITHUB_TOKEN ===" grep -n "GH_TOKEN\|GITHUB_TOKEN" .github/workflows/build-test-publish-wheel.yml echo -e "\n=== Checking if there's a global env block ===" head -40 .github/workflows/build-test-publish-wheel.yml | grep -A 5 "^env:"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 338

🏁 Script executed:

# Check if there are any other workflow files that might define SKIPPING_IS_ALLOWED find .github/workflows -name "*.yml" -type f -exec grep -l "SKIPPING_IS_ALLOWED" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 255

🏁 Script executed:

# Check other workflow files for SKIPPING_IS_ALLOWED usage echo "=== Searching all workflow files for SKIPPING_IS_ALLOWED ===" grep -n "SKIPPING_IS_ALLOWED" .github/workflows/*.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1374

🏁 Script executed:

# Check if we can find documentation or examples about FW-CI-templates outputs grep -rn "docs_only\|is_deployment_workflow" .github/workflows/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3200

🏁 Script executed:

# Check the copyright-check.yml Result step sed -n '39,60p' .github/workflows/copyright-check.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 935

🏁 Script executed:

# Check if build-docs.yml has env block with GH_TOKEN for the Result step sed -n '50,65p' .github/workflows/build-docs.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 940

🏁 Script executed:

# Let me verify that github.token is the correct approach vs secrets.PAT # Check what build-docs uses and if there's any documentation echo "=== Checking build-docs.yml Result step ===" sed -n '48,65p' .github/workflows/build-docs.yml echo -e "\n=== Also checking if copyright-check.yml has the same issue ===" sed -n '39,60p' .github/workflows/copyright-check.yml | head -25

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1978

Add missing environment variables to the Result step.

The Result step uses SKIPPING_IS_ALLOWED at line 87 and the gh CLI at lines 85 and 93, but neither variable nor authentication is defined. This causes the condition to always evaluate SKIPPING_IS_ALLOWED as empty, and the gh CLI calls may fail due to missing authentication.

The pattern is correctly implemented in other workflows (e.g., build-docs.yml). Add the missing environment variables:

Fix: Add env block with GH_TOKEN and SKIPPING_IS_ALLOWED

- name: Result + env: + GH_TOKEN: ${{ github.token }} + SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' }} run: | FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0 if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then

🤖 Prompt for AI Agents

In @.github/workflows/build-test-publish-wheel.yml around lines 83 - 95, The Result step uses SKIPPING_IS_ALLOWED and the gh CLI without defining required environment variables or authentication; add an env block to that step that sets GH_TOKEN (for gh CLI authentication) and SKIPPING_IS_ALLOWED (preserving existing workflow logic) so gh run view calls succeed and the conditional [ "$SKIPPING_IS_ALLOWED" == "true" ] evaluates correctly; update the step that contains the gh run view and condition to include env: GH_TOKEN: ${{ secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{ env.SKIPPING_IS_ALLOWED }} (or the appropriate default) so the gh CLI and SKIPPING_IS_ALLOWED reference are properly defined.

coderabbitai · 2026-02-05T15:08:16Z

.github/workflows/build-test-publish-wheel.yml

+    steps:
+      - name: Result
+        run: |
+          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0


⚠️ Potential issue | 🟡 Minor

Add fallback handling for gh run view failure.

The command uses || echo 0 but this doesn't properly assign 0 to FAILED_JOBS on failure. The current syntax would echo "0" to stdout but FAILED_JOBS remains unset.

🐛 Proposed fix for proper fallback assignment

- FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0 + FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0

FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0

🤖 Prompt for AI Agents

In @.github/workflows/build-test-publish-wheel.yml at line 85, The assignment to FAILED_JOBS needs a proper fallback when gh run view fails: change the substitution so the fallback is produced inside the command substitution (or set a default after assignment) — i.e., ensure the gh run view invocation (the command producing the jq result) is followed by || echo 0 within the $(...) so FAILED_JOBS receives "0" on failure; reference the FAILED_JOBS variable and the gh run view ... --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' command when making this change.

ci(fix): Wheel build (#2192)

9cd94dc

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g requested a review from a team as a code owner February 5, 2026 15:04

ko3n1g added cherry-pick Run CICD labels Feb 5, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 15:04 Inactive

copy-pr-bot bot temporarily deployed to test February 5, 2026 15:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 15:06 Inactive

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 15:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 15:52 Inactive

ko3n1g merged commit 94af2ed into r0.3.0 Feb 6, 2026
54 checks passed

ko3n1g deleted the cherry-pick-2192-r0.3.0 branch February 6, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `ci(fix): Wheel build (2192)` into `r0.3.0`#2238

cp: `ci(fix): Wheel build (2192)` into `r0.3.0`#2238
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2192-r0.3.0

ko3n1g commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ko3n1g commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] \| select(.status == "completed" and .conclusion != "success")] \| length') \|\| echo 0
	FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] \| select(.status == "completed" and .conclusion != "success")] \| length' 2>/dev/null) \|\| FAILED_JOBS=0

Conversation

ko3n1g commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

ko3n1g commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ko3n1g commented Feb 5, 2026 •

edited by coderabbitai bot

Loading