[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043
Closed
benenzhu wants to merge 2 commits intosgl-project:mainfrom
Closed
[ROCM] Using origin fp8 aiter kernel for Deepseek Model on MI300x#14043benenzhu wants to merge 2 commits intosgl-project:mainfrom
benenzhu wants to merge 2 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The DeepSeek-R1 model have some performance degradation on AMD ROCM MI300x GPUs.
This PR ensures that
gemm_a8w8_blockscale_triton(introduced in #13617) is only applied on CDNA4 architecture cards (e.g., MI355x), while MI300x cards continue to use the originalaiter gemm_a8w8_blockscalekernel.Modifications
Replace:
gemm_a8w8_blockscale_tritonto only take effect when_use_aiter_gfx95is enabled (CDNA4 architecture)aiter gemm_a8w8_blockscalekernelAccuracy Tests
Model : DeepSeek-R1-0528/
Benchmarking and Profiling
Environment
rocm/sgl-dev:v0.5.5.post3-rocm700-mi30x-202511254.0.0+1a5c7ec7.8.0Benchmark Command
python3 -m sglang.launch_server --model-path=/data/DeepSeek-R1-0528 \ --host=0.0.0.0 \ --port=30000 \ --trust-remote-code \ --tensor-parallel-size=8 \ --mem-fraction-static=0.8 \ --cuda-graph-max-bs=128 \ --chunked-prefill-size=196608 \ --num-continuous-decode-steps=4 \ --max-prefill-tokens=196608 --disable-radix-cache python3 -m sglang.bench_serving --backend sglang --num-prompt 10Results Summary
Detailed Results - Before
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max request concurrency: not set Successful requests: 10 Benchmark duration (s): 15.46 Total input tokens: 1972 Total input text tokens: 1972 Total input vision tokens: 0 Total generated tokens: 2784 Total generated tokens (retokenized): 2780 Request throughput (req/s): 0.65 Input token throughput (tok/s): 127.58 Output token throughput (tok/s): 180.11 Total token throughput (tok/s): 307.69 Concurrency: 5.96 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 9209.08 Median E2E Latency (ms): 10112.96 ---------------Time to First Token---------------- Mean TTFT (ms): 1951.16 Median TTFT (ms): 1948.86 P99 TTFT (ms): 1998.40 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.73 Median TPOT (ms): 26.30 P99 TPOT (ms): 27.55 ---------------Inter-Token Latency---------------- Mean ITL (ms): 26.18 Median ITL (ms): 26.21 P95 ITL (ms): 27.60 P99 ITL (ms): 27.89 Max ITL (ms): 58.63 ==================================================Detailed Results - After
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max request concurrency: not set Successful requests: 10 Benchmark duration (s): 9.55 Total input tokens: 1972 Total input text tokens: 1972 Total input vision tokens: 0 Total generated tokens: 2784 Total generated tokens (retokenized): 2782 Request throughput (req/s): 1.05 Input token throughput (tok/s): 206.44 Output token throughput (tok/s): 291.45 Total token throughput (tok/s): 497.89 Concurrency: 5.50 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 5256.51 Median E2E Latency (ms): 5893.98 ---------------Time to First Token---------------- Mean TTFT (ms): 220.22 Median TTFT (ms): 220.79 P99 TTFT (ms): 254.94 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.89 Median TPOT (ms): 18.26 P99 TPOT (ms): 19.23 ---------------Inter-Token Latency---------------- Mean ITL (ms): 18.17 Median ITL (ms): 18.20 P95 ITL (ms): 19.22 P99 ITL (ms): 19.65 Max ITL (ms): 62.31 ==================================================Checklist