Skip to content

cp: ci(fix): Wheel build (2192) into r0.3.0#2238

Merged
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2192-r0.3.0
Feb 6, 2026
Merged

cp: ci(fix): Wheel build (2192) into r0.3.0#2238
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2192-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 5, 2026

beep boop [🤖]: Hi @ko3n1g 👋,

we've cherry picked #2192 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Chores

    • Updated CI/CD workflow for building and publishing with enhanced configuration and improved result handling
    • Updated dependencies including TransformerEngine (with CUDA support) and nvidia-resiliency-ext for better compatibility
  • Documentation

    • Clarified return type annotations in model specifications

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g ko3n1g requested a review from a team as a code owner February 5, 2026 15:04
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 5, 2026

/ok to test 9cd94dc

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

This PR updates the GitHub Actions workflow to enable the build-test-publish-wheel job with updated preflight and workflow versions, adds new input parameters and secrets configuration, updates the container image to PyTorch 25.11, and refreshes dependency specifications in pyproject.toml including TransformerEngine and nvidia-resiliency-ext versions.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/build-test-publish-wheel.yml
Enables build-test-publish-wheel and build-test-publish-wheel-summary jobs; updates preflight workflow reference from v0.64.2 to v0.69.1; switches build job to v0.70.1; adds comprehensive input parameters and secrets passthrough (TWINE_USERNAME, TWINE_PASSWORD, SLACK_WEBHOOK, GH_TOKEN); updates container image to nvcr.io/nvidia/pytorch:25.11-py3; implements dynamic runner configuration from pre-flight outputs; restores result handling with failed job detection.
Project Dependencies
pyproject.toml
Updates TransformerEngine to require CUDA core support (transformer-engine[pytorch,core_cu13]>=2.10.0a0,<2.13.0) with Git reference override; updates nvidia-resiliency-ext to compatible revision (~=0.4.1); removes prerelease modelopt dependency; consolidates uv.sources configuration.
Documentation
src/megatron/bridge/training/mlm_compat/model.py
Updates docstring return type annotation from generic transformer_layer_spec to ModuleSpec for clarity.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • ci(fix): Wheel build #2192: Directly related workflow changes enabling build-test-publish-wheel jobs, updating workflow references, container image, and runner configuration.

Suggested labels

r0.3.0

Suggested reviewers

  • thomasdhc
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes a cherry-pick operation of a CI fix for wheel build from PR #2192 into the r0.3.0 branch, matching the pull request's main purpose and changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed This pull request contains minor changes focused on CI/CD infrastructure and dependency maintenance rather than major changes to core functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2192-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/build-test-publish-wheel.yml:
- Line 85: The assignment to FAILED_JOBS needs a proper fallback when gh run
view fails: change the substitution so the fallback is produced inside the
command substitution (or set a default after assignment) — i.e., ensure the gh
run view invocation (the command producing the jq result) is followed by || echo
0 within the $(...) so FAILED_JOBS receives "0" on failure; reference the
FAILED_JOBS variable and the gh run view ... --json jobs --jq '[.jobs[] |
select(.status == "completed" and .conclusion != "success")] | length' command
when making this change.
- Around line 83-95: The Result step uses SKIPPING_IS_ALLOWED and the gh CLI
without defining required environment variables or authentication; add an env
block to that step that sets GH_TOKEN (for gh CLI authentication) and
SKIPPING_IS_ALLOWED (preserving existing workflow logic) so gh run view calls
succeed and the conditional [ "$SKIPPING_IS_ALLOWED" == "true" ] evaluates
correctly; update the step that contains the gh run view and condition to
include env: GH_TOKEN: ${{ secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{
env.SKIPPING_IS_ALLOWED }} (or the appropriate default) so the gh CLI and
SKIPPING_IS_ALLOWED reference are properly defined.

Comment on lines +83 to +95
- name: Result
run: |
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0

# if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
# echo "✅ All previous jobs completed successfully"
# exit 0
# else
# echo "❌ Found $FAILED_JOBS failed job(s)"
# # Show which jobs failed
# gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
# exit 1
# fi
if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
echo "✅ All previous jobs completed successfully"
exit 0
else
echo "❌ Found $FAILED_JOBS failed job(s)"
# Show which jobs failed
gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/build-test-publish-wheel.yml | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4295


🏁 Script executed:

# Search for SKIPPING_IS_ALLOWED and GH_TOKEN/GITHUB_TOKEN usage throughout the workflow
echo "=== Searching for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/build-test-publish-wheel.yml

echo -e "\n=== Searching for GH_TOKEN or GITHUB_TOKEN ==="
grep -n "GH_TOKEN\|GITHUB_TOKEN" .github/workflows/build-test-publish-wheel.yml

echo -e "\n=== Checking if there's a global env block ==="
head -40 .github/workflows/build-test-publish-wheel.yml | grep -A 5 "^env:"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 338


🏁 Script executed:

# Check if there are any other workflow files that might define SKIPPING_IS_ALLOWED
find .github/workflows -name "*.yml" -type f -exec grep -l "SKIPPING_IS_ALLOWED" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 255


🏁 Script executed:

# Check other workflow files for SKIPPING_IS_ALLOWED usage
echo "=== Searching all workflow files for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/*.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1374


🏁 Script executed:

# Check if we can find documentation or examples about FW-CI-templates outputs
grep -rn "docs_only\|is_deployment_workflow" .github/workflows/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3200


🏁 Script executed:

# Check the copyright-check.yml Result step
sed -n '39,60p' .github/workflows/copyright-check.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 935


🏁 Script executed:

# Check if build-docs.yml has env block with GH_TOKEN for the Result step
sed -n '50,65p' .github/workflows/build-docs.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 940


🏁 Script executed:

# Let me verify that github.token is the correct approach vs secrets.PAT
# Check what build-docs uses and if there's any documentation
echo "=== Checking build-docs.yml Result step ==="
sed -n '48,65p' .github/workflows/build-docs.yml

echo -e "\n=== Also checking if copyright-check.yml has the same issue ==="
sed -n '39,60p' .github/workflows/copyright-check.yml | head -25

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1978


Add missing environment variables to the Result step.

The Result step uses SKIPPING_IS_ALLOWED at line 87 and the gh CLI at lines 85 and 93, but neither variable nor authentication is defined. This causes the condition to always evaluate SKIPPING_IS_ALLOWED as empty, and the gh CLI calls may fail due to missing authentication.

The pattern is correctly implemented in other workflows (e.g., build-docs.yml). Add the missing environment variables:

Fix: Add env block with GH_TOKEN and SKIPPING_IS_ALLOWED
      - name: Result
+        env:
+          GH_TOKEN: ${{ github.token }}
+          SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' }}
         run: |
           FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0

           if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml around lines 83 - 95, The
Result step uses SKIPPING_IS_ALLOWED and the gh CLI without defining required
environment variables or authentication; add an env block to that step that sets
GH_TOKEN (for gh CLI authentication) and SKIPPING_IS_ALLOWED (preserving
existing workflow logic) so gh run view calls succeed and the conditional [
"$SKIPPING_IS_ALLOWED" == "true" ] evaluates correctly; update the step that
contains the gh run view and condition to include env: GH_TOKEN: ${{
secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{ env.SKIPPING_IS_ALLOWED }} (or
the appropriate default) so the gh CLI and SKIPPING_IS_ALLOWED reference are
properly defined.

steps:
- name: Result
run: |
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add fallback handling for gh run view failure.

The command uses || echo 0 but this doesn't properly assign 0 to FAILED_JOBS on failure. The current syntax would echo "0" to stdout but FAILED_JOBS remains unset.

🐛 Proposed fix for proper fallback assignment
-          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
+          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0
🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml at line 85, The assignment to
FAILED_JOBS needs a proper fallback when gh run view fails: change the
substitution so the fallback is produced inside the command substitution (or set
a default after assignment) — i.e., ensure the gh run view invocation (the
command producing the jq result) is followed by || echo 0 within the $(...) so
FAILED_JOBS receives "0" on failure; reference the FAILED_JOBS variable and the
gh run view ... --json jobs --jq '[.jobs[] | select(.status == "completed" and
.conclusion != "success")] | length' command when making this change.

@ko3n1g ko3n1g merged commit 94af2ed into r0.3.0 Feb 6, 2026
54 checks passed
@ko3n1g ko3n1g deleted the cherry-pick-2192-r0.3.0 branch February 6, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant