cp: ci(fix): Wheel build (2192) into r0.3.0#2238
Conversation
Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 9cd94dc |
📝 WalkthroughWalkthroughThis PR updates the GitHub Actions workflow to enable the build-test-publish-wheel job with updated preflight and workflow versions, adds new input parameters and secrets configuration, updates the container image to PyTorch 25.11, and refreshes dependency specifications in pyproject.toml including TransformerEngine and nvidia-resiliency-ext versions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/build-test-publish-wheel.yml:
- Line 85: The assignment to FAILED_JOBS needs a proper fallback when gh run
view fails: change the substitution so the fallback is produced inside the
command substitution (or set a default after assignment) — i.e., ensure the gh
run view invocation (the command producing the jq result) is followed by || echo
0 within the $(...) so FAILED_JOBS receives "0" on failure; reference the
FAILED_JOBS variable and the gh run view ... --json jobs --jq '[.jobs[] |
select(.status == "completed" and .conclusion != "success")] | length' command
when making this change.
- Around line 83-95: The Result step uses SKIPPING_IS_ALLOWED and the gh CLI
without defining required environment variables or authentication; add an env
block to that step that sets GH_TOKEN (for gh CLI authentication) and
SKIPPING_IS_ALLOWED (preserving existing workflow logic) so gh run view calls
succeed and the conditional [ "$SKIPPING_IS_ALLOWED" == "true" ] evaluates
correctly; update the step that contains the gh run view and condition to
include env: GH_TOKEN: ${{ secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{
env.SKIPPING_IS_ALLOWED }} (or the appropriate default) so the gh CLI and
SKIPPING_IS_ALLOWED reference are properly defined.
| - name: Result | ||
| run: | | ||
| FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0 | ||
|
|
||
| # if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then | ||
| # echo "✅ All previous jobs completed successfully" | ||
| # exit 0 | ||
| # else | ||
| # echo "❌ Found $FAILED_JOBS failed job(s)" | ||
| # # Show which jobs failed | ||
| # gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name' | ||
| # exit 1 | ||
| # fi | ||
| if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then | ||
| echo "✅ All previous jobs completed successfully" | ||
| exit 0 | ||
| else | ||
| echo "❌ Found $FAILED_JOBS failed job(s)" | ||
| # Show which jobs failed | ||
| gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name' | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/build-test-publish-wheel.yml | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4295
🏁 Script executed:
# Search for SKIPPING_IS_ALLOWED and GH_TOKEN/GITHUB_TOKEN usage throughout the workflow
echo "=== Searching for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/build-test-publish-wheel.yml
echo -e "\n=== Searching for GH_TOKEN or GITHUB_TOKEN ==="
grep -n "GH_TOKEN\|GITHUB_TOKEN" .github/workflows/build-test-publish-wheel.yml
echo -e "\n=== Checking if there's a global env block ==="
head -40 .github/workflows/build-test-publish-wheel.yml | grep -A 5 "^env:"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 338
🏁 Script executed:
# Check if there are any other workflow files that might define SKIPPING_IS_ALLOWED
find .github/workflows -name "*.yml" -type f -exec grep -l "SKIPPING_IS_ALLOWED" {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 255
🏁 Script executed:
# Check other workflow files for SKIPPING_IS_ALLOWED usage
echo "=== Searching all workflow files for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/*.ymlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1374
🏁 Script executed:
# Check if we can find documentation or examples about FW-CI-templates outputs
grep -rn "docs_only\|is_deployment_workflow" .github/workflows/Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3200
🏁 Script executed:
# Check the copyright-check.yml Result step
sed -n '39,60p' .github/workflows/copyright-check.ymlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 935
🏁 Script executed:
# Check if build-docs.yml has env block with GH_TOKEN for the Result step
sed -n '50,65p' .github/workflows/build-docs.ymlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 940
🏁 Script executed:
# Let me verify that github.token is the correct approach vs secrets.PAT
# Check what build-docs uses and if there's any documentation
echo "=== Checking build-docs.yml Result step ==="
sed -n '48,65p' .github/workflows/build-docs.yml
echo -e "\n=== Also checking if copyright-check.yml has the same issue ==="
sed -n '39,60p' .github/workflows/copyright-check.yml | head -25Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1978
Add missing environment variables to the Result step.
The Result step uses SKIPPING_IS_ALLOWED at line 87 and the gh CLI at lines 85 and 93, but neither variable nor authentication is defined. This causes the condition to always evaluate SKIPPING_IS_ALLOWED as empty, and the gh CLI calls may fail due to missing authentication.
The pattern is correctly implemented in other workflows (e.g., build-docs.yml). Add the missing environment variables:
Fix: Add env block with GH_TOKEN and SKIPPING_IS_ALLOWED
- name: Result
+ env:
+ GH_TOKEN: ${{ github.token }}
+ SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' }}
run: |
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml around lines 83 - 95, The
Result step uses SKIPPING_IS_ALLOWED and the gh CLI without defining required
environment variables or authentication; add an env block to that step that sets
GH_TOKEN (for gh CLI authentication) and SKIPPING_IS_ALLOWED (preserving
existing workflow logic) so gh run view calls succeed and the conditional [
"$SKIPPING_IS_ALLOWED" == "true" ] evaluates correctly; update the step that
contains the gh run view and condition to include env: GH_TOKEN: ${{
secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{ env.SKIPPING_IS_ALLOWED }} (or
the appropriate default) so the gh CLI and SKIPPING_IS_ALLOWED reference are
properly defined.
| steps: | ||
| - name: Result | ||
| run: | | ||
| FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0 |
There was a problem hiding this comment.
Add fallback handling for gh run view failure.
The command uses || echo 0 but this doesn't properly assign 0 to FAILED_JOBS on failure. The current syntax would echo "0" to stdout but FAILED_JOBS remains unset.
🐛 Proposed fix for proper fallback assignment
- FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
+ FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0 | |
| FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0 |
🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml at line 85, The assignment to
FAILED_JOBS needs a proper fallback when gh run view fails: change the
substitution so the fallback is produced inside the command substitution (or set
a default after assignment) — i.e., ensure the gh run view invocation (the
command producing the jq result) is followed by || echo 0 within the $(...) so
FAILED_JOBS receives "0" on failure; reference the FAILED_JOBS variable and the
gh run view ... --json jobs --jq '[.jobs[] | select(.status == "completed" and
.conclusion != "success")] | length' command when making this change.
beep boop [🤖]: Hi @ko3n1g 👋,
Summary by CodeRabbit
Chores
Documentation