[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x by benenzhu · Pull Request #14043 · sgl-project/sglang

benenzhu · 2025-11-27T05:28:36Z

Motivation

The DeepSeek-R1 model have some performance degradation on AMD ROCM MI300x GPUs.
This PR ensures that gemm_a8w8_blockscale_triton (introduced in #13617) is only applied on CDNA4 architecture cards (e.g., MI355x), while MI300x cards continue to use the original aiter gemm_a8w8_blockscale kernel.

Modifications

Replace:

Restrict gemm_a8w8_blockscale_triton to only take effect when _use_aiter_gfx95 is enabled (CDNA4 architecture)
MI300x (CDNA3) will fall back to the original aiter gemm_a8w8_blockscale kernel

Accuracy Tests

Model : DeepSeek-R1-0528/

python3 benchmark/gsm8k/bench_sglang.py --num-questions 500 --parallel 30 --port 30000

Metric	Before	After
GSM8K Accuracy	0.964	0.962

Benchmarking and Profiling

Environment

Docker: rocm/sgl-dev:v0.5.5.post3-rocm700-mi30x-20251125
ROCM-SMI version: 4.0.0+1a5c7ec
ROCM-SMI-LIB version: 7.8.0

Benchmark Command

python3 -m sglang.launch_server --model-path=/data/DeepSeek-R1-0528 \
      --host=0.0.0.0 \
      --port=30000 \
      --trust-remote-code \
      --tensor-parallel-size=8 \
      --mem-fraction-static=0.8 \
      --cuda-graph-max-bs=128 \
      --chunked-prefill-size=196608 \
      --num-continuous-decode-steps=4 \
      --max-prefill-tokens=196608 --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --num-prompt 10

Results Summary

Metric	Before	After
Mean ITL (ms)	26.18	18.17
Mean TPOT (ms)	25.73	17.89
Output throughput (tok/s)	180.11	291.45
Mean TTFT (ms)	1951.16	220.22

Detailed Results - Before

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  15.46
Total input tokens:                      1972
Total input text tokens:                 1972
Total input vision tokens:               0
Total generated tokens:                  2784
Total generated tokens (retokenized):    2780
Request throughput (req/s):              0.65
Input token throughput (tok/s):          127.58
Output token throughput (tok/s):         180.11
Total token throughput (tok/s):          307.69
Concurrency:                             5.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9209.08
Median E2E Latency (ms):                 10112.96
---------------Time to First Token----------------
Mean TTFT (ms):                          1951.16
Median TTFT (ms):                        1948.86
P99 TTFT (ms):                           1998.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.73
Median TPOT (ms):                        26.30
P99 TPOT (ms):                           27.55
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.18
Median ITL (ms):                         26.21
P95 ITL (ms):                            27.60
P99 ITL (ms):                            27.89
Max ITL (ms):                            58.63
==================================================

Detailed Results - After

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  9.55
Total input tokens:                      1972
Total input text tokens:                 1972
Total input vision tokens:               0
Total generated tokens:                  2784
Total generated tokens (retokenized):    2782
Request throughput (req/s):              1.05
Input token throughput (tok/s):          206.44
Output token throughput (tok/s):         291.45
Total token throughput (tok/s):          497.89
Concurrency:                             5.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5256.51
Median E2E Latency (ms):                 5893.98
---------------Time to First Token----------------
Mean TTFT (ms):                          220.22
Median TTFT (ms):                        220.79
P99 TTFT (ms):                           254.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.89
Median TPOT (ms):                        18.26
P99 TPOT (ms):                           19.23
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.17
Median ITL (ms):                         18.20
P95 ITL (ms):                            19.22
P99 ITL (ms):                            19.65
Max ITL (ms):                            62.31
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-27T05:28:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

using origin aiter kernel for mi300x

a681b59

benenzhu requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 27, 2025 05:28

Merge branch 'main' into fix_mi300x_deepseek

38694fc

benenzhu closed this Dec 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043

[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043
benenzhu wants to merge 2 commits intosgl-project:mainfrom
benenzhu:fix_mi300x_deepseek

benenzhu commented Nov 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benenzhu commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Replace:

Accuracy Tests

Benchmarking and Profiling

Environment

Benchmark Command

Results Summary

Checklist

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benenzhu commented Nov 27, 2025 •

edited

Loading