Skip to content

[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043

Closed
benenzhu wants to merge 2 commits intosgl-project:mainfrom
benenzhu:fix_mi300x_deepseek
Closed

[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043
benenzhu wants to merge 2 commits intosgl-project:mainfrom
benenzhu:fix_mi300x_deepseek

Conversation

@benenzhu
Copy link
Copy Markdown

@benenzhu benenzhu commented Nov 27, 2025

Motivation

The DeepSeek-R1 model have some performance degradation on AMD ROCM MI300x GPUs.
This PR ensures that gemm_a8w8_blockscale_triton (introduced in #13617) is only applied on CDNA4 architecture cards (e.g., MI355x), while MI300x cards continue to use the original aiter gemm_a8w8_blockscale kernel.

Modifications

Replace:

  • Restrict gemm_a8w8_blockscale_triton to only take effect when _use_aiter_gfx95 is enabled (CDNA4 architecture)
  • MI300x (CDNA3) will fall back to the original aiter gemm_a8w8_blockscale kernel

Accuracy Tests

Model : DeepSeek-R1-0528/

python3 benchmark/gsm8k/bench_sglang.py --num-questions 500 --parallel 30 --port 30000
Metric Before After
GSM8K Accuracy 0.964 0.962

Benchmarking and Profiling

Environment

  • Docker: rocm/sgl-dev:v0.5.5.post3-rocm700-mi30x-20251125
  • ROCM-SMI version: 4.0.0+1a5c7ec
  • ROCM-SMI-LIB version: 7.8.0

Benchmark Command

python3 -m sglang.launch_server --model-path=/data/DeepSeek-R1-0528 \
      --host=0.0.0.0 \
      --port=30000 \
      --trust-remote-code \
      --tensor-parallel-size=8 \
      --mem-fraction-static=0.8 \
      --cuda-graph-max-bs=128 \
      --chunked-prefill-size=196608 \
      --num-continuous-decode-steps=4 \
      --max-prefill-tokens=196608 --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --num-prompt 10

Results Summary

Metric Before After
Mean ITL (ms) 26.18 18.17
Mean TPOT (ms) 25.73 17.89
Output throughput (tok/s) 180.11 291.45
Mean TTFT (ms) 1951.16 220.22
Detailed Results - Before
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  15.46
Total input tokens:                      1972
Total input text tokens:                 1972
Total input vision tokens:               0
Total generated tokens:                  2784
Total generated tokens (retokenized):    2780
Request throughput (req/s):              0.65
Input token throughput (tok/s):          127.58
Output token throughput (tok/s):         180.11
Total token throughput (tok/s):          307.69
Concurrency:                             5.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9209.08
Median E2E Latency (ms):                 10112.96
---------------Time to First Token----------------
Mean TTFT (ms):                          1951.16
Median TTFT (ms):                        1948.86
P99 TTFT (ms):                           1998.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.73
Median TPOT (ms):                        26.30
P99 TPOT (ms):                           27.55
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.18
Median ITL (ms):                         26.21
P95 ITL (ms):                            27.60
P99 ITL (ms):                            27.89
Max ITL (ms):                            58.63
==================================================
Detailed Results - After
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  9.55
Total input tokens:                      1972
Total input text tokens:                 1972
Total input vision tokens:               0
Total generated tokens:                  2784
Total generated tokens (retokenized):    2782
Request throughput (req/s):              1.05
Input token throughput (tok/s):          206.44
Output token throughput (tok/s):         291.45
Total token throughput (tok/s):          497.89
Concurrency:                             5.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5256.51
Median E2E Latency (ms):                 5893.98
---------------Time to First Token----------------
Mean TTFT (ms):                          220.22
Median TTFT (ms):                        220.79
P99 TTFT (ms):                           254.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.89
Median TPOT (ms):                        18.26
P99 TPOT (ms):                           19.23
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.17
Median ITL (ms):                         18.20
P95 ITL (ms):                            19.22
P99 ITL (ms):                            19.65
Max ITL (ms):                            62.31
==================================================

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@benenzhu benenzhu closed this Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant