Skip to content

[AMD] Add rocm7.2.3 support#26010

Open
sogalin wants to merge 1 commit into
sgl-project:mainfrom
sogalin:update-rocm723
Open

[AMD] Add rocm7.2.3 support#26010
sogalin wants to merge 1 commit into
sgl-project:mainfrom
sogalin:update-rocm723

Conversation

@sogalin

@sogalin sogalin commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Extend the AMD ROCm docker matrix to include the rocm/pytorch:rocm7.2.3_ubuntu22.04_py3.10_pytorch_release_2.9.1 base image so MI300X / MI355X users can build sglang against the newer ROCm 7.2.3 toolchain. ROCm 7.2.3 ships a different bundled torch + triton (triton==3.5.1+rocm7.2.3.gita272dfa8) compared to ROCm 7.2.0, and AITER kernels need a compatible Triton with two upstream cherry-picks to run correctly on top of it. We also take the opportunity to clean up the existing rocm7.2.0 torch hot patch.

Modifications

1. New rocm7.2.3 stages

  • New ARGs: BASE_IMAGE_942_ROCM723, BASE_IMAGE_950_ROCM723
  • New FROM stages: gfx942-rocm723, gfx950-rocm723
  • Usage examples added to the header comment block

2. Custom Triton build for rocm723 (gated by case "${GPU_ARCH}" in *rocm723*)

  • Repo: https://github.com/ROCm/triton.git
  • Commit: ba5c1517
  • Cherry-picks (from triton-lang/triton, reachable in the ROCm fork):

3. Unified torch METADATA hot patch (refactor)

Both rocm/pytorch:rocm7.2.0 and rocm/pytorch:rocm7.2.3 ship a pre-installed torch wheel whose METADATA hard-pins triton:
Requires-Dist: triton==3.5.1+rocm7.2.x.git...; platform_system == "Linux" ...

Since this Dockerfile replaces triton with a custom build (BUILD_TRITON=1), the pin causes pip check / future pip install to fail with a version conflict.
The previous solution (hack.py) read a .whl from /, extracted it, edited METADATA, re-zipped, and pip install --force --no-deps. That added ~3 minutes per build and kept the 1.6GB source wheel at /. It also doesn't work for rocm7.2.3, where the base image does not ship a wheel at / (only the installed torch).
This PR replaces the wheel-roundtrip with a small hack_inplace.py that edits the installed torch-*.dist-info/METADATA to relax the pin to triton>=3.5.1 and blanks the matching RECORD row. Used by both rocm720 and rocm723 via case "${GPU_ARCH}" in *rocm720*|*rocm723*).
Diff summary:

  • Removed: ARG TORCH_ROCM_FILE, the wheel-based hack.py heredoc, the wheel-based RUN flow
  • Added: a single hack_inplace.py heredoc + a unified case branch
  • rocm720 build now finishes the patch in <1s instead of ~3 min, and the source wheel is cleaned up (rm -f /torch-*.whl)

4. amd-smi case extended

rocm/pytorch:rocm7.2.3 does not pre-install amd-smi either (verified inside the base image). Case extended:

-      *rocm720*) \
+      *rocm720*|*rocm723*) \
         echo "ROCm (GPU_ARCH=${GPU_ARCH}): installing amd-smi"; \
         cd /opt/rocm/share/amd_smi && python3 -m pip install --no-cache-dir . ;;

5. libdrm-amdgpu case generalized

rocm720rocm72 so the entire ROCm 7.2.x family bypasses the libdrm-amdgpu install (all 7.2.x bases ship the packages already). Behavior unchanged for rocm720.

Accuracy Tests

SGLANG_AITER_MLA_PERSIST=1 AITER_MXFP4_MOE_SF=1 SGLANG_USE_AITER=1 SGLANG_INT4_WEIGHT=0 SGLANG_MOE_PADDING=1 SGLANG_SET_CPU_AFFINITY=1 SGLANG_ROCM_FUSED_DECODE_MLA=1 SGLANG_USE_ROCM700A=1 python3 -m sglang.launch_server --model-path /dockerx/data/DeepSeek-R1-MXFP4-Preview/ --tensor-parallel-size 8 --trust-remote-code --host 0.0.0.0 --port 8000 --log-requests --mem-fraction-static 0.95 --chunked-prefill-size 131072 --attention-backend aiter --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-draft-model-path /dockerx/data/DeepSeek-R1-NextN --max-running-requests 64 --disable-radix-cache --kv-cache-dtype fp8_e4m3
GSM8K: 0.942

python3 -m sglang.launch_server --model-path /data/amd/Kimi-K2.5-MXFP4/ --tensor-parallel-size 4 --trust-remote-code --mem-fraction-static 0.765 --disable-radix-cache --decode-attention-backend aiter --prefill-attention-backend aiter --kv-cache-dtype fp8_e4m3 --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 1024 --cuda-graph-max-bs 1024 --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --enable-aiter-allreduce-fusion --host 127.0.0.1 --port 8888
GSM8K: 0.93

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #26244096516
Latest PR Test (Extra): ❌ Run #26244096405

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant