Skip to content

Use triton wheel no fork#2959

Merged
Dewei-Wang-sh merged 19 commits into
mainfrom
use_triton_wheel_no_fork
Apr 30, 2026
Merged

Use triton wheel no fork#2959
Dewei-Wang-sh merged 19 commits into
mainfrom
use_triton_wheel_no_fork

Conversation

@mengfei-jiang
Copy link
Copy Markdown
Contributor

@mengfei-jiang mengfei-jiang commented Apr 29, 2026

Motivation

Currently triton is either built from source in CI (build-triton job) or relies on whatever version is pre-installed in the Docker base image. This is slow, fragile, and inconsistent across workflows. AMD now publishes pre-built amd-triton wheels on ROCm 7.0, ROCm 7.1, and ROCm 7.2, making source builds unnecessary. This PR centralizes triton installation into a single shared script that auto-detects the ROCm version and installs the matching amd-triton wheel, ensuring all CI workflows and local development use the same triton distribution.

Additionally, there were no tests verifying that triton operators work correctly with torch.compile. This PR adds 10 torch.compile compatibility tests to catch regressions early.

Technical Details

  • Replace triton source builds with amd-triton from AMD PyPI: Introduce a shared install_triton.sh script that auto-detects the ROCm version via rocm-core and installs the matching amd-triton wheel from https://pypi.amd.com/triton/rocm-{major}.{minor}.0/simple/. This eliminates the need to build triton from source in CI, removing the build-triton job from triton-test.yaml (~80 lines).
  • Unify triton installation across all CI workflows: Add install_triton.sh to aiter-test.yaml, atom-test.yaml, sglang_downstream.yaml, and vllm_benchmark.yaml, ensuring all workflows use the same amd-triton version. The script uninstalls all conflicting triton variants (triton, pytorch-triton, pytorch-triton-rocm, triton-rocm, amd-triton) before installing.
  • Auto-install in develop mode: setup.py now calls install_triton.sh during python setup.py develop, so developers get amd-triton installed automatically.
  • Add torch.compile compatibility tests: 10 new test files under op_tests/triton_tests/torch_compile/ verifying that triton operators work correctly with torch.compile(backend="inductor", fullgraph=True). Covers activation, fused_mul_add, gemm, moe_routing, quantization (per-tensor/per-token), rmsnorm, rope, softmax, and topk.
  • Deduplicate _get_compiled helper: Extract the shared _get_compiled function into torch_compile/init.py so all test files import from a single location.
  • Add torch_compile test times to split_tests.sh: Include FILE_TIMES for the 10 new tests (~90s total) to enable proper shard balancing in CI.
  • Update README: Add Triton section documenting amd-triton installation with AMD PyPI index URLs and the install_triton.sh script.

Test Plan

  • triton-test.yaml passes without the build-triton job (amd-triton installed via install_triton.sh)
  • aiter-test.yaml standard and multi-gpu tests pass with Install amd-triton step
  • atom-test.yaml, sglang_downstream.yaml, vllm_benchmark.yaml CI workflows pass
  • All 10 torch_compile tests on MI300X and MI35X runners

Test Result

All can pass

Submission Checklist

@mengfei-jiang mengfei-jiang requested a review from a team April 29, 2026 09:40
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2959 --add-label <label>

@brunomazzottiamd

This comment was marked as resolved.

@brunomazzottiamd

This comment was marked as resolved.

mengfei-jiang and others added 19 commits April 29, 2026 15:50
Update build-triton job to first attempt downloading a pre-built wheel
from rocm.frameworks-nightlies.amd.com, falling back to source build
only when the download fails. Also bump TRITON_COMMIT to d1660454.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Update wheel URL format from triton-3.7.0+amd.git<commit>
to triton-3.7.0+rocm7.2.0.git<commit> to match the actual
naming convention on the nightly server.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The server returns 403 when '+' is used literally in the URL.
Percent-encode it as %2B while keeping the local filename with '+'.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The wheel files are under gfx942-gfx950/ not gfx942-gfx950/triton/.
The triton/ subdirectory is a PEP 503 index page whose links point
to ../  (the parent directory).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add requirements-triton.txt with --extra-index-url for AMD PyPI
- Add pip install -r requirements-triton.txt in build_aiter_triton.sh
- Remove build-triton job from triton-test.yaml, use BUILD_TRITON=0
- Update README.md with Triton installation instructions

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Relax atol/rtol to 0.1 for bfloat16 due to lower precision (7-bit
mantissa). Add fullgraph=True to enforce full graph compilation
without eager fallback.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Change dtype params to [float16, bfloat16] across all torch_compile tests
- Add torch._dynamo.reset() to prevent recompile limit with fullgraph=True
- Relax tolerance for bf16 in fused_mul_add and activation tests (atol=0.1)

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Move triton dependency from a separate requirements-triton.txt (using
AMD PyPI index) to the standard amd-triton package on PyPI, added as
both a build and runtime dependency. This simplifies installation by
making `pip install -e .` handle triton automatically.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
amd-triton is now available on PyPI directly, so the extra index URL
for AMD PyPI is no longer needed.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
If amd-triton is not yet installed, pip uninstall returns non-zero
which would abort setup.py. The reinstall call is kept as check_call
to ensure amd-triton is always installed with the latest content.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Use requirements-triton.txt for triton installation instead of
embedding it in pyproject.toml/setup.py. The file now references
amd-triton from PyPI directly, no extra index URL needed.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Replace requirements-triton.txt with inline ROCm version detection
in setup.py and CI script. Uninstall all conflicting triton packages
(triton, pytorch-triton, pytorch-triton-rocm, triton-rocm, amd-triton)
before installing amd-triton with the correct --extra-index-url based
on the detected ROCm version.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add ROCm version detection and amd-triton installation to atom-test,
vllm_benchmark, and sglang_downstream workflows before pip install -e .
Wrap amd-triton install in setup.py with try/except to avoid build
failure in PEP 517 isolated environments where pip is unavailable.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Consolidate duplicated ROCm version detection and amd-triton
installation logic into .github/scripts/install_triton.sh. Update
all CI workflows (build_aiter_triton, atom-test, vllm_benchmark,
sglang_downstream) and README to call the shared script.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Replace inline ROCm version detection and amd-triton install code
in setup.py with a call to the shared install_triton.sh script.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…nstall amd-triton in aiter-test CI

- Move _get_compiled into torch_compile/__init__.py so all test files
  import from a single location
- Add FILE_TIMES for the 10 torch_compile tests to split_tests.sh
- Add Install amd-triton step in aiter-test.yaml for standard and
  multi-gpu test jobs

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@mengfei-jiang mengfei-jiang force-pushed the use_triton_wheel_no_fork branch from 7d3a4db to 6bdadaa Compare April 29, 2026 15:57
@mengfei-jiang

This comment was marked as resolved.

@mengfei-jiang

This comment was marked as resolved.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

I've checked the new test files, LGTM! However, I don't have enough CI knowledge to comment on the scripts and workflows. Let's wait for an approaval from AITER CI team.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

Failures in Flash Attention Integration jobs are being addressed in #2695.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

Added ci:triton-300x label to trigger execution of Triton unit tests in gfx942 nodes.

Copy link
Copy Markdown
Contributor

@Dewei-Wang-sh Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, lgtm

@Dewei-Wang-sh
Copy link
Copy Markdown
Contributor

Motivation

Replace the triton dependency management with auto-detection of the ROCm version to install the correct amd-triton

however, this description is outdated and only show one point, please change accordingly.
you need to clarify what this pr does and what this pr for..

@Dewei-Wang-sh Dewei-Wang-sh requested a review from zhanglx13 April 30, 2026 01:38
@Dewei-Wang-sh Dewei-Wang-sh merged commit 77bda8d into main Apr 30, 2026
123 of 144 checks passed
@Dewei-Wang-sh Dewei-Wang-sh deleted the use_triton_wheel_no_fork branch April 30, 2026 03:01
gyohuangxin pushed a commit that referenced this pull request May 3, 2026
…2985)

PR #2959 introduced .github/scripts/install_triton.sh and added an
"Install amd-triton" step to aiter-test.yaml that calls the script
inside the docker container. The container's working directory is the
PR's checkout, so any PR opened or last synced before #2959 landed on
main does not contain the script and fails with:

  bash: line 1: ./.github/scripts/install_triton.sh: No such file
  ##[error]Process completed with exit code 127.

This blocks Standard Tests on every stale PR (e.g. #2969, all 9/10
shards failing), forcing authors to rebase just to get green CI.

Fix: in the Install amd-triton step, fall back to fetching the script
from the base ref via raw.githubusercontent.com when it is not present
in the runner workspace. Workflow files for PR events always come from
the base branch, so this stays consistent with the rest of the CI flow
and adds no security boundary crossing.

Applied symmetrically to the Standard Tests (1 GPU) and Multi-GPU
Tests (8 GPU) jobs. atom-test.yaml and sglang_downstream.yaml also
call the script after a fresh git clone of the PR sha and would
benefit from a similar fallback in a follow-up.
chun-wan pushed a commit that referenced this pull request May 4, 2026
…2985)

PR #2959 introduced .github/scripts/install_triton.sh and added an
"Install amd-triton" step to aiter-test.yaml that calls the script
inside the docker container. The container's working directory is the
PR's checkout, so any PR opened or last synced before #2959 landed on
main does not contain the script and fails with:

  bash: line 1: ./.github/scripts/install_triton.sh: No such file
  ##[error]Process completed with exit code 127.

This blocks Standard Tests on every stale PR (e.g. #2969, all 9/10
shards failing), forcing authors to rebase just to get green CI.

Fix: in the Install amd-triton step, fall back to fetching the script
from the base ref via raw.githubusercontent.com when it is not present
in the runner workspace. Workflow files for PR events always come from
the base branch, so this stays consistent with the rest of the CI flow
and adds no security boundary crossing.

Applied symmetrically to the Standard Tests (1 GPU) and Multi-GPU
Tests (8 GPU) jobs. atom-test.yaml and sglang_downstream.yaml also
call the script after a fresh git clone of the PR sha and would
benefit from a similar fallback in a follow-up.
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
Currently triton is either built from source in CI (build-triton job) or relies on whatever version is pre-installed in the Docker base image. This is slow, fragile, and inconsistent across workflows. AMD now publishes pre-built amd-triton wheels on ROCm 7.0, ROCm 7.1, and ROCm 7.2, making source builds unnecessary. This PR centralizes triton installation into a single shared script that auto-detects the ROCm version and installs the matching amd-triton wheel, ensuring all CI workflows and local development use the same triton distribution.

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
…2985)

PR #2959 introduced .github/scripts/install_triton.sh and added an
"Install amd-triton" step to aiter-test.yaml that calls the script
inside the docker container. The container's working directory is the
PR's checkout, so any PR opened or last synced before #2959 landed on
main does not contain the script and fails with:

  bash: line 1: ./.github/scripts/install_triton.sh: No such file
  ##[error]Process completed with exit code 127.

This blocks Standard Tests on every stale PR (e.g. #2969, all 9/10
shards failing), forcing authors to rebase just to get green CI.

Fix: in the Install amd-triton step, fall back to fetching the script
from the base ref via raw.githubusercontent.com when it is not present
in the runner workspace. Workflow files for PR events always come from
the base branch, so this stays consistent with the rest of the CI flow
and adds no security boundary crossing.

Applied symmetrically to the Standard Tests (1 GPU) and Multi-GPU
Tests (8 GPU) jobs. atom-test.yaml and sglang_downstream.yaml also
call the script after a fresh git clone of the PR sha and would
benefit from a similar fallback in a follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants