[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs by hubertlu-tw · Pull Request #13747 · sgl-project/sglang

hubertlu-tw · 2025-11-22T00:17:56Z

Motivation

This PR adds --enable-aiter-allreduce-fusion for ROCm so SGLang can use fused tensor-parallel allreduce + residual RMSNorm when runtime eligibility is satisfied.

The intent is to improve serving performance while preserving correctness and existing non-ROCm behavior.

Modifications

Added the --enable-aiter-allreduce-fusion server option and wired it to ROCm fused AR+RMSNorm dispatch.
Implemented fused-path selection through the RMSNorm fused forward path, with fallback to standard TP allreduce + RMSNorm when fused execution is unavailable.
Kept NVIDIA/FlashInfer behavior unchanged.
Added deterministic guard: when deterministic inference is enabled, --enable-aiter-allreduce-fusion is disabled.

Commands Used

Case A server (without fusion):

SGLANG_AITER_MLA_PERSIST=1 \
AITER_MXFP4_MOE_SF=1 \
SGLANG_INT4_WEIGHT=0 \
SGLANG_MOE_PADDING=1 \
SGLANG_SET_CPU_AFFINITY=1 \
SGLANG_ROCM_FUSED_DECODE_MLA=1 \
SGLANG_USE_ROCM700A=1 \
SGLANG_USE_AITER=1 \
python3 -m sglang.launch_server \
  --model-path amd/deepseek-ai/DeepSeek-R1-MXFP4-Preview \
  --tp 8 \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --attention-backend aiter \
  --mem-fraction-static 0.95 \
  --disable-radix-cache --kv-cache-dtype fp8_e4m3

Case B server (with fusion):

SGLANG_AITER_MLA_PERSIST=1 \
AITER_MXFP4_MOE_SF=1 \
SGLANG_INT4_WEIGHT=0 \
SGLANG_MOE_PADDING=1 \
SGLANG_SET_CPU_AFFINITY=1 \
SGLANG_ROCM_FUSED_DECODE_MLA=1 \
SGLANG_USE_ROCM700A=1 \
SGLANG_USE_AITER=1 \
python3 -m sglang.launch_server \
  --model-path amd/deepseek-ai/DeepSeek-R1-MXFP4-Preview \
  --tp 8 \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --attention-backend aiter \
  --mem-fraction-static 0.95 \
  --disable-radix-cache --kv-cache-dtype fp8_e4m3 \
  --enable-aiter-allreduce-fusion

Accuracy gate:

python3 benchmark/gsm8k/bench_sglang.py \
  --host http://127.0.0.1 --port 30000 \
  --num-shots 8 --num-questions 1319 --parallel 1319

Serving sweep:

CON="8 16 32 64"
COMBINATIONS=("8192/1024")
for combo in "${COMBINATIONS[@]}"; do
  IFS="/" read -r isl osl <<< "$combo"
  for con in $CON; do
    python3 -m sglang.bench_serving \
      --backend sglang \
      --host 127.0.0.1 \
      --port 10086 \
      --dataset-name random \
      --random-range-ratio 1 \
      --num-prompt $((con * 10)) \
      --random-input $isl \
      --random-output $osl \
      --max-concurrency $con
  done
done

Accuracy Tests

GSM8K gate results:

Case A: Accuracy=0.950, Latency=72.881s, Output throughput=1871.650 tok/s
Case B: Accuracy=0.951, Latency=61.457s, Output throughput=2238.694 tok/s

Benchmarking and Profiling

Workload: random dataset, input/output = 8192/1024, concurrency {8,16,32,64}.

Concurrency	Case A total tok/s	Case B total tok/s	Delta B vs A	Case A median E2E Latency (ms)	Case B median E2E Latency (ms)	Delta B vs A	Case A median TTFT (ms)	Case B median TTFT (ms)	Delta B vs A	Case A median TPOT (ms)	Case B median TPOT (ms)	Delta B vs A	Case A median ITL (ms)	Case B median ITL (ms)	Delta B vs A
8	5347.68	5759.25	+7.70%	13777.02	12801.64	-7.08%	1108.45	1125.73	+1.56%	12.38	11.41	-7.84%	11.86	10.87	-8.35%
16	8190.25	8579.27	+4.75%	17996.47	17166.69	-4.61%	1904.98	1932.47	+1.44%	15.73	14.89	-5.34%	14.42	13.55	-6.03%
32	12367.67	12690.42	+2.61%	23842.63	23218.23	-2.62%	3491.71	3551.33	+1.71%	19.89	19.22	-3.37%	17.03	16.30	-4.29%
64	17011.06	16847.80	-0.96%	34639.83	34967.07	+0.94%	6663.76	6780.08	+1.75%	27.35	27.55	+0.73%	21.38	21.51	+0.61%

Summary:

Fusion improves throughput and latency at con=8/16/32.
At con=64, fusion is slightly slower in this setup.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

gemini-code-assist · 2025-11-22T00:17:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-02-11T02:48:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hubertlu-tw · 2026-02-12T07:12:36Z

/rerun-failed-ci

bingxche · 2026-02-26T02:08:16Z

Thanks for your contribution!

I think the tests didn't add to AMD CI yet, need to add.
cc @yctseng0211 @michaelzhang-ai

…t#13747) Co-authored-by: yctseng0211 <yctseng@amd.com>

hubertlu-tw requested review from HaiShaw and kkHuang-amd November 22, 2025 00:17

hubertlu-tw added the amd label Nov 22, 2025

hubertlu-tw requested review from BBuf, Edwardf0t1, Fridge003, Ying1123, ch-wan, ispobock, merrymercy and yizhang2077 as code owners November 22, 2025 00:17

hubertlu-tw marked this pull request as draft November 22, 2025 00:18

github-actions bot added the documentation Improvements or additions to documentation label Nov 22, 2025

hubertlu-tw force-pushed the aiter_fused_ar branch from b70adf1 to 091c4b9 Compare November 24, 2025 19:01

yctseng0211 force-pushed the aiter_fused_ar branch from 091c4b9 to 9de5afa Compare December 4, 2025 08:55

github-actions bot added the deepseek label Dec 4, 2025

yctseng0211 force-pushed the aiter_fused_ar branch from 870d0d3 to c7795fe Compare December 9, 2025 04:12

hubertlu-tw and others added 5 commits February 11, 2026 01:24

[WIP] integrate aiter's fused_allreduce_rmsnorm into SGLang

5cbe255

more

fa29dd3

format code

49b5ee4

fix for the interface changed from aiter

1c0f60a

more

ca18f94

hubertlu-tw force-pushed the aiter_fused_ar branch from c7795fe to ca18f94 Compare February 11, 2026 02:48

hubertlu-tw marked this pull request as ready for review February 11, 2026 02:48

hubertlu-tw added the run-ci label Feb 11, 2026

hubertlu-tw and others added 3 commits February 10, 2026 18:57

Merge branch 'main' into aiter_fused_ar

5279414

Refactor the integration

a285880

Add benchmark and test scripts

64506c7

HaiShaw approved these changes Feb 25, 2026

View reviewed changes

HaiShaw merged commit 17b0aff into sgl-project:main Feb 25, 2026
102 of 143 checks passed

bingxche mentioned this pull request Feb 26, 2026

[AMD] [CI] Add Tests for Aiter Allreduce Fusion #19417

Closed

5 tasks

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (sgl-projec…

5afbfa2

…t#13747) Co-authored-by: yctseng0211 <yctseng@amd.com>

mmangkad mentioned this pull request Mar 15, 2026

[Fix] Remove redundant allreduce fusion block and skip TP=1 #20621

Merged

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (sgl-projec…

79b79de

…t#13747) Co-authored-by: yctseng0211 <yctseng@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs#13747

[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs#13747
HaiShaw merged 8 commits intosgl-project:mainfrom
hubertlu-tw:aiter_fused_ar

hubertlu-tw commented Nov 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

hubertlu-tw commented Feb 12, 2026

Uh oh!

Uh oh!

bingxche commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hubertlu-tw commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Commands Used

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

hubertlu-tw commented Feb 12, 2026

Uh oh!

Uh oh!

bingxche commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hubertlu-tw commented Nov 22, 2025 •

edited

Loading