Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 56 additions & 50 deletions .github/workflows/build-test-publish-wheel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Build, test, and publish a PyPi wheel (to testpypi).

on:
Expand All @@ -35,55 +34,62 @@ concurrency:

jobs:
pre-flight:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.64.2
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.69.1
with:
default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }}
non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }}
default_test_data_path: ${{ vars.DEFAULT_TEST_DATA_PATH }}
non_nvidia_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }}
secrets:
NVIDIA_MANAGEMENT_ORG_PAT: ${{ secrets.NVIDIA_MANAGEMENT_ORG_PAT }}

# build-test-publish-wheel:
# needs: [pre-flight]
# if: |
# !(needs.pre-flight.outputs.docs_only == 'true'
# || needs.pre-flight.outputs.is_deployment_workflow == 'true')
# uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.65.1
# with:
# dry-run: true
# python-package: megatron.bridge
# python-version: "3.10"
# packaging: uv
# no-publish: ${{ !(github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) }}
# has-src-dir: true
# skip-test-wheel: true
# custom-container: nvcr.io/nvidia/pytorch:25.05-py3
# runner: self-hosted-nemo
# no-build-isolation: true
# submodules: recursive
# container-options: "--gpus all --runtime=nvidia"
# secrets:
# TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
# TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
# SLACK_WEBHOOK: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
# SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
# GH_TOKEN: ${{ secrets.PAT }}
build-test-publish-wheel:
needs: [pre-flight]
if: |
!(needs.pre-flight.outputs.docs_only == 'true'
|| needs.pre-flight.outputs.is_deployment_workflow == 'true')
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.70.1
with:
dry-run: true
python-package: megatron.bridge
python-version: "3.10"
packaging: uv
no-publish: ${{ !(github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) }}
has-src-dir: true
skip-test-wheel: true
custom-container: nvcr.io/nvidia/pytorch:25.11-py3
runner: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2-container
no-build-isolation: true
submodules: recursive
container-options: "--gpus all --runtime=nvidia"
secrets:
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
SLACK_WEBHOOK: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
GH_TOKEN: ${{ secrets.PAT }}

# build-test-publish-wheel-summary:
# needs: [pre-flight, build-test-publish-wheel]
# if: |
# (
# needs.pre-flight.outputs.docs_only == 'true'
# || needs.pre-flight.outputs.is_deployment_workflow == 'true'
# || always()
# )
# && !cancelled()
# runs-on: ubuntu-latest
# steps:
# - name: Result
# run: |
# FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
build-test-publish-wheel-summary:
needs: [pre-flight, build-test-publish-wheel]
if: |
(
needs.pre-flight.outputs.docs_only == 'true'
|| needs.pre-flight.outputs.is_deployment_workflow == 'true'
|| always()
)
&& !cancelled()
runs-on: ubuntu-latest
steps:
- name: Result
run: |
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add fallback handling for gh run view failure.

The command uses || echo 0 but this doesn't properly assign 0 to FAILED_JOBS on failure. The current syntax would echo "0" to stdout but FAILED_JOBS remains unset.

🐛 Proposed fix for proper fallback assignment
-          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
+          FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length' 2>/dev/null) || FAILED_JOBS=0
🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml at line 85, The assignment to
FAILED_JOBS needs a proper fallback when gh run view fails: change the
substitution so the fallback is produced inside the command substitution (or set
a default after assignment) — i.e., ensure the gh run view invocation (the
command producing the jq result) is followed by || echo 0 within the $(...) so
FAILED_JOBS receives "0" on failure; reference the FAILED_JOBS variable and the
gh run view ... --json jobs --jq '[.jobs[] | select(.status == "completed" and
.conclusion != "success")] | length' command when making this change.


# if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
# echo "✅ All previous jobs completed successfully"
# exit 0
# else
# echo "❌ Found $FAILED_JOBS failed job(s)"
# # Show which jobs failed
# gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
# exit 1
# fi
if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
echo "✅ All previous jobs completed successfully"
exit 0
else
echo "❌ Found $FAILED_JOBS failed job(s)"
# Show which jobs failed
gh run view $GITHUB_RUN_ID --json jobs --jq '.jobs[] | select(.status == "completed" and .conclusion != "success") | .name'
exit 1
fi
Comment on lines +83 to +95
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/build-test-publish-wheel.yml | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4295


🏁 Script executed:

# Search for SKIPPING_IS_ALLOWED and GH_TOKEN/GITHUB_TOKEN usage throughout the workflow
echo "=== Searching for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/build-test-publish-wheel.yml

echo -e "\n=== Searching for GH_TOKEN or GITHUB_TOKEN ==="
grep -n "GH_TOKEN\|GITHUB_TOKEN" .github/workflows/build-test-publish-wheel.yml

echo -e "\n=== Checking if there's a global env block ==="
head -40 .github/workflows/build-test-publish-wheel.yml | grep -A 5 "^env:"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 338


🏁 Script executed:

# Check if there are any other workflow files that might define SKIPPING_IS_ALLOWED
find .github/workflows -name "*.yml" -type f -exec grep -l "SKIPPING_IS_ALLOWED" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 255


🏁 Script executed:

# Check other workflow files for SKIPPING_IS_ALLOWED usage
echo "=== Searching all workflow files for SKIPPING_IS_ALLOWED ==="
grep -n "SKIPPING_IS_ALLOWED" .github/workflows/*.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1374


🏁 Script executed:

# Check if we can find documentation or examples about FW-CI-templates outputs
grep -rn "docs_only\|is_deployment_workflow" .github/workflows/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3200


🏁 Script executed:

# Check the copyright-check.yml Result step
sed -n '39,60p' .github/workflows/copyright-check.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 935


🏁 Script executed:

# Check if build-docs.yml has env block with GH_TOKEN for the Result step
sed -n '50,65p' .github/workflows/build-docs.yml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 940


🏁 Script executed:

# Let me verify that github.token is the correct approach vs secrets.PAT
# Check what build-docs uses and if there's any documentation
echo "=== Checking build-docs.yml Result step ==="
sed -n '48,65p' .github/workflows/build-docs.yml

echo -e "\n=== Also checking if copyright-check.yml has the same issue ==="
sed -n '39,60p' .github/workflows/copyright-check.yml | head -25

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1978


Add missing environment variables to the Result step.

The Result step uses SKIPPING_IS_ALLOWED at line 87 and the gh CLI at lines 85 and 93, but neither variable nor authentication is defined. This causes the condition to always evaluate SKIPPING_IS_ALLOWED as empty, and the gh CLI calls may fail due to missing authentication.

The pattern is correctly implemented in other workflows (e.g., build-docs.yml). Add the missing environment variables:

Fix: Add env block with GH_TOKEN and SKIPPING_IS_ALLOWED
      - name: Result
+        env:
+          GH_TOKEN: ${{ github.token }}
+          SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' }}
         run: |
           FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion != "success")] | length') || echo 0

           if [ "${FAILED_JOBS:-0}" -eq 0 ] || [ "$SKIPPING_IS_ALLOWED" == "true" ]; then
🤖 Prompt for AI Agents
In @.github/workflows/build-test-publish-wheel.yml around lines 83 - 95, The
Result step uses SKIPPING_IS_ALLOWED and the gh CLI without defining required
environment variables or authentication; add an env block to that step that sets
GH_TOKEN (for gh CLI authentication) and SKIPPING_IS_ALLOWED (preserving
existing workflow logic) so gh run view calls succeed and the conditional [
"$SKIPPING_IS_ALLOWED" == "true" ] evaluates correctly; update the step that
contains the gh run view and condition to include env: GH_TOKEN: ${{
secrets.GH_TOKEN }} and SKIPPING_IS_ALLOWED: ${{ env.SKIPPING_IS_ALLOWED }} (or
the appropriate default) so the gh CLI and SKIPPING_IS_ALLOWED reference are
properly defined.

14 changes: 5 additions & 9 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,9 @@ dependencies = [
"hydra-core>1.3,<=1.3.2",
"megatron-core[dev,mlm]>=0.15.0a0,<0.17.0",
"qwen-vl-utils",
"transformer-engine[pytorch]>=2.10.0a0,<2.12.0",
"transformer-engine[pytorch,core_cu13]>=2.10.0a0,<2.13.0",
"mamba-ssm",
"nvidia-resiliency-ext",
"nvidia-resiliency-ext~=0.4.1",
"causal-conv1d",
"flash-linear-attention",
"timm",
Expand All @@ -108,21 +108,17 @@ no-build-isolation-package = [
]
prerelease = "allow"
override-dependencies = [
"nvidia-modelopt[torch]>=0.37.0",
"torch; sys_platform == 'never'",
"torchvision; sys_platform == 'never'",
"triton; sys_platform == 'never'",
"transformer-engine[pytorch]>=2.9.0a0,<2.10.0",
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@6a34b6574fa6c29d9d07fdcddf9812cbb1488878",

]

# uv.sources allows us to override dependencies with VCS commits.
# Lets use this only for debugging purposes, but not for production (main).
[tool.uv.sources]
transformer-engine = { git = "https://github.com/NVIDIA/TransformerEngine.git", rev = "6a34b6574fa6c29d9d07fdcddf9812cbb1488878" }
megatron-core = { path = "3rdparty/Megatron-LM/" }
nvidia-resiliency-ext = { git = "https://github.com/NVIDIA/nvidia-resiliency-ext.git", rev = "54f85fe422d296cf04ea524130014bd3a2c3add1" }
nvidia-modelopt = { git = "https://github.com/NVIDIA/TensorRT-Model-Optimizer.git", rev = "0a4f0a8b933121f7af080261a0a5a7717f2c5d49" }
# mamba-ssm = { git = "https://github.com/yfw/mamba", branch = "general_stride_fix" }
nvidia-resiliency-ext = { git = "https://github.com/NVIDIA/nvidia-resiliency-ext.git", rev = "v0.4.1" } # Requires a source install to compile cupti for cuda13

[project.optional-dependencies]
recipes = [
Expand Down
2 changes: 1 addition & 1 deletion src/megatron/bridge/training/mlm_compat/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def _get_transformer_layer_spec(args: argparse.Namespace, use_te: bool, use_kitc
use_kitchen: Whether to use kitchen extension

Returns:
transformer_layer_spec: The transformer layer specification
ModuleSpec: The transformer layer specification
"""
if use_te:
return get_gpt_layer_with_transformer_engine_spec(
Expand Down
Loading
Loading