Skip to content

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs#13747

Merged
HaiShaw merged 8 commits intosgl-project:mainfrom
hubertlu-tw:aiter_fused_ar
Feb 25, 2026
Merged

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs#13747
HaiShaw merged 8 commits intosgl-project:mainfrom
hubertlu-tw:aiter_fused_ar

Conversation

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

@hubertlu-tw hubertlu-tw commented Nov 22, 2025

Motivation

This PR adds --enable-aiter-allreduce-fusion for ROCm so SGLang can use fused tensor-parallel allreduce + residual RMSNorm when runtime eligibility is satisfied.

The intent is to improve serving performance while preserving correctness and existing non-ROCm behavior.

Modifications

  • Added the --enable-aiter-allreduce-fusion server option and wired it to ROCm fused AR+RMSNorm dispatch.
  • Implemented fused-path selection through the RMSNorm fused forward path, with fallback to standard TP allreduce + RMSNorm when fused execution is unavailable.
  • Kept NVIDIA/FlashInfer behavior unchanged.
  • Added deterministic guard: when deterministic inference is enabled, --enable-aiter-allreduce-fusion is disabled.

Commands Used

Case A server (without fusion):

SGLANG_AITER_MLA_PERSIST=1 \
AITER_MXFP4_MOE_SF=1 \
SGLANG_INT4_WEIGHT=0 \
SGLANG_MOE_PADDING=1 \
SGLANG_SET_CPU_AFFINITY=1 \
SGLANG_ROCM_FUSED_DECODE_MLA=1 \
SGLANG_USE_ROCM700A=1 \
SGLANG_USE_AITER=1 \
python3 -m sglang.launch_server \
  --model-path amd/deepseek-ai/DeepSeek-R1-MXFP4-Preview \
  --tp 8 \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --attention-backend aiter \
  --mem-fraction-static 0.95 \
  --disable-radix-cache --kv-cache-dtype fp8_e4m3

Case B server (with fusion):

SGLANG_AITER_MLA_PERSIST=1 \
AITER_MXFP4_MOE_SF=1 \
SGLANG_INT4_WEIGHT=0 \
SGLANG_MOE_PADDING=1 \
SGLANG_SET_CPU_AFFINITY=1 \
SGLANG_ROCM_FUSED_DECODE_MLA=1 \
SGLANG_USE_ROCM700A=1 \
SGLANG_USE_AITER=1 \
python3 -m sglang.launch_server \
  --model-path amd/deepseek-ai/DeepSeek-R1-MXFP4-Preview \
  --tp 8 \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --attention-backend aiter \
  --mem-fraction-static 0.95 \
  --disable-radix-cache --kv-cache-dtype fp8_e4m3 \
  --enable-aiter-allreduce-fusion

Accuracy gate:

python3 benchmark/gsm8k/bench_sglang.py \
  --host http://127.0.0.1 --port 30000 \
  --num-shots 8 --num-questions 1319 --parallel 1319

Serving sweep:

CON="8 16 32 64"
COMBINATIONS=("8192/1024")
for combo in "${COMBINATIONS[@]}"; do
  IFS="/" read -r isl osl <<< "$combo"
  for con in $CON; do
    python3 -m sglang.bench_serving \
      --backend sglang \
      --host 127.0.0.1 \
      --port 10086 \
      --dataset-name random \
      --random-range-ratio 1 \
      --num-prompt $((con * 10)) \
      --random-input $isl \
      --random-output $osl \
      --max-concurrency $con
  done
done

Accuracy Tests

GSM8K gate results:

  • Case A: Accuracy=0.950, Latency=72.881s, Output throughput=1871.650 tok/s
  • Case B: Accuracy=0.951, Latency=61.457s, Output throughput=2238.694 tok/s

Benchmarking and Profiling

Workload: random dataset, input/output = 8192/1024, concurrency {8,16,32,64}.

Concurrency Case A total tok/s Case B total tok/s Delta B vs A Case A median E2E Latency (ms) Case B median E2E Latency (ms) Delta B vs A Case A median TTFT (ms) Case B median TTFT (ms) Delta B vs A Case A median TPOT (ms) Case B median TPOT (ms) Delta B vs A Case A median ITL (ms) Case B median ITL (ms) Delta B vs A
8 5347.68 5759.25 +7.70% 13777.02 12801.64 -7.08% 1108.45 1125.73 +1.56% 12.38 11.41 -7.84% 11.86 10.87 -8.35%
16 8190.25 8579.27 +4.75% 17996.47 17166.69 -4.61% 1904.98 1932.47 +1.44% 15.73 14.89 -5.34% 14.42 13.55 -6.03%
32 12367.67 12690.42 +2.61% 23842.63 23218.23 -2.62% 3491.71 3551.33 +1.71% 19.89 19.22 -3.37% 17.03 16.30 -4.29%
64 17011.06 16847.80 -0.96% 34639.83 34967.07 +0.94% 6663.76 6780.08 +1.75% 27.35 27.55 +0.73% 21.38 21.51 +0.61%

Summary:

  • Fusion improves throughput and latency at con=8/16/32.
  • At con=64, fusion is slightly slower in this setup.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hubertlu-tw hubertlu-tw marked this pull request as draft November 22, 2025 00:18
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 22, 2025
@hubertlu-tw hubertlu-tw marked this pull request as ready for review February 11, 2026 02:48
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hubertlu-tw
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@HaiShaw HaiShaw merged commit 17b0aff into sgl-project:main Feb 25, 2026
102 of 143 checks passed
@bingxche
Copy link
Copy Markdown
Collaborator

Thanks for your contribution!

I think the tests didn't add to AMD CI yet, need to add.
cc @yctseng0211 @michaelzhang-ai

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants