[AMD] Add rocm7.2.3 support#26010
Open
sogalin wants to merge 1 commit into
Open
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This was referenced May 27, 2026
This was referenced Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Extend the AMD ROCm docker matrix to include the
rocm/pytorch:rocm7.2.3_ubuntu22.04_py3.10_pytorch_release_2.9.1base image so MI300X / MI355X users can build sglang against the newer ROCm 7.2.3 toolchain. ROCm 7.2.3 ships a different bundled torch + triton (triton==3.5.1+rocm7.2.3.gita272dfa8) compared to ROCm 7.2.0, and AITER kernels need a compatible Triton with two upstream cherry-picks to run correctly on top of it. We also take the opportunity to clean up the existing rocm7.2.0 torch hot patch.Modifications
1. New rocm7.2.3 stages
BASE_IMAGE_942_ROCM723,BASE_IMAGE_950_ROCM723gfx942-rocm723,gfx950-rocm723BUILD_TRITON=1AITER_COMMIT_DEFAULTsynced with the upstream upgrade (32e1e6d76988...from [AMD] Upgrade AITER #25896)2. Custom Triton build for rocm723 (gated by
case "${GPU_ARCH}" in *rocm723*)https://github.com/ROCm/triton.gitba5c1517555d04f→ triton-lang/triton#8991dd998b6→ triton-lang/triton#9541rocm720 / rocm700 stages keep the unchanged legacy Triton build (
triton-lang/triton @ 42270451…).3. Unified torch METADATA hot patch (refactor)
Both
rocm/pytorch:rocm7.2.0androcm/pytorch:rocm7.2.3ship a pre-installed torch wheel whoseMETADATAhard-pins triton:Requires-Dist: triton==3.5.1+rocm7.2.x.git...; platform_system == "Linux" ...
Since this Dockerfile replaces triton with a custom build (
BUILD_TRITON=1), the pin causespip check/ futurepip installto fail with a version conflict.The previous solution (
hack.py) read a.whlfrom/, extracted it, editedMETADATA, re-zipped, andpip install --force --no-deps. That added ~3 minutes per build and kept the 1.6GB source wheel at/. It also doesn't work for rocm7.2.3, where the base image does not ship a wheel at/(only the installed torch).This PR replaces the wheel-roundtrip with a small
hack_inplace.pythat edits the installedtorch-*.dist-info/METADATAto relax the pin totriton>=3.5.1and blanks the matchingRECORDrow. Used by both rocm720 and rocm723 viacase "${GPU_ARCH}" in *rocm720*|*rocm723*).Diff summary:
ARG TORCH_ROCM_FILE, the wheel-basedhack.pyheredoc, the wheel-based RUN flowhack_inplace.pyheredoc + a unified case branchrm -f /torch-*.whl)4. amd-smi case extended
rocm/pytorch:rocm7.2.3does not pre-install amd-smi either (verified inside the base image). Case extended:5. libdrm-amdgpu case generalized
rocm720 → rocm72 so the entire ROCm 7.2.x family bypasses the libdrm-amdgpu install (all 7.2.x bases ship the packages already). Behavior unchanged for rocm720.
Accuracy Tests
SGLANG_AITER_MLA_PERSIST=1 AITER_MXFP4_MOE_SF=1 SGLANG_USE_AITER=1 SGLANG_INT4_WEIGHT=0 SGLANG_MOE_PADDING=1 SGLANG_SET_CPU_AFFINITY=1 SGLANG_ROCM_FUSED_DECODE_MLA=1 SGLANG_USE_ROCM700A=1 python3 -m sglang.launch_server --model-path /dockerx/data/DeepSeek-R1-MXFP4-Preview/ --tensor-parallel-size 8 --trust-remote-code --host 0.0.0.0 --port 8000 --log-requests --mem-fraction-static 0.95 --chunked-prefill-size 131072 --attention-backend aiter --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-path /dockerx/data/DeepSeek-R1-NextN --max-running-requests 64 --disable-radix-cache --kv-cache-dtype fp8_e4m3
GSM8K: 0.942
python3 -m sglang.launch_server --model-path /data/amd/Kimi-K2.5-MXFP4/ --tensor-parallel-size 4 --trust-remote-code --mem-fraction-static 0.765 --disable-radix-cache --decode-attention-backend aiter --prefill-attention-backend aiter --kv-cache-dtype fp8_e4m3 --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 1024 --cuda-graph-max-bs 1024 --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --enable-aiter-allreduce-fusion --host 127.0.0.1 --port 8888
GSM8K: 0.93
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ✅ Run #26244096516
Latest PR Test (Extra): ❌ Run #26244096405