[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649
[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for Deepseek v3.2 on ROCm by adding new Triton kernels and a specialized backend. It also refactors the sparse_attn_indexer logic into a dedicated file, which is a good architectural improvement. However, I've identified a critical bug in the refactored CUDA path that could lead to an AttributeError, and a significant limitation in the new ROCm kernels due to a hardcoded value that restricts flexibility. Addressing these issues will improve the robustness and applicability of these optimizations.
| ): | ||
| return torch.ops.vllm.sparse_attn_indexer( | ||
| hidden_states, | ||
| self.k_cache.layer_prefix, |
There was a problem hiding this comment.
The k_cache object is of type DeepseekV32IndexerCache, which has a prefix attribute but not a layer_prefix attribute. Using self.k_cache.layer_prefix will result in an AttributeError. The HIP path correctly uses self.k_cache.prefix. This should be consistent.
| self.k_cache.layer_prefix, | |
| self.k_cache.prefix, |
| chunk.cu_seqlen_ke, | ||
| ) | ||
| num_rows = logits.shape[0] | ||
| assert topk_tokens == 2048, "top_k_per_row assumes size 2048" |
There was a problem hiding this comment.
The code asserts that topk_tokens must be 2048. This hardcoded value limits the flexibility of the sparse attention indexer. If this is a temporary limitation of the underlying custom C++ op, it should be noted with a TODO. For broader applicability, this should be made more flexible or at least provide a more informative error message if the value is unsupported.
| ) | ||
|
|
||
| num_rows = logits.shape[0] | ||
| assert topk_tokens == 2048, "top_k_per_row assumes size 2048" |
40265e8 to
9707e58
Compare
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
9707e58 to
e569fa2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com>
51208a1 to
ac122ee
Compare
Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com> Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
ac122ee to
210821a
Compare
Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com>
9bb16c2 to
51208a1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…ze=64 + SHUFFLE layout) Based on vllm-project#32649 by @ganyi1996ppo — rebased onto current main with structural adaptations. Switch the ROCm sparse MLA backend from the stage1+reduce indexer path to the gluon preshuffle kernel (deepgemm_fp8_paged_mqa_logits with Preshuffle=True, KVBlockSize=64). This replaces a two-kernel pipeline (deepgemm_fp8_paged_mqa_logits_stage1 + reduce<float,sum>) with a single fused Triton kernel, yielding ~1 ms savings per decode iteration on MI355X TP4 at 1K context. Key changes: - ROCMAiterMLASparseBackend now inherits from AiterMLABackend to reuse FP8 KV cache infrastructure (dtype support, prefill path, metadata) - ROCMAiterMLASparseImpl inherits from AiterMLAImpl; forward_mqa overridden for sparse decode via mla_decode_fwd with topk indices - Added FP8 casting + q_scale/k_scale passing in _forward_sparse_mla - KV cache flattened for mla_decode_fwd when block_size > 1 - Triton indexer kernels use SHUFFLE layout (was NHD) - rocm_fp8_paged_mqa_logits uses gluon API when block_size > 1, falls back to stage1 otherwise - DeepseekV32IndexerBackend returns block_size=64 (was 1 on ROCm) - Parent-allocated oversized buffers released in metadata builder __init__ to save ~52 MB/layer Profiled result (1K input / 100 output, TP4 MI355X): Baseline: 21.9 ms/iter → Gluon: 18.2 ms/iter (includes run-to-run noise; conservative estimate ~1.5-2.0 ms real) Accuracy (GSM8K 5-shot): 0.9121 vs 0.9424 baseline — 3pp regression under investigation (likely FP8 scale handling or layout numerics). Signed-off-by: frida-andersson <fanderss@amd.com>
|
close this PR and replace it in #41217 |
Purpose
This PR move some of the original feature from #29287 to here. Includes some triton 3.5.0 depending kernel. And add more optimization on
ROCMAiterMLASparseBackend. This PR depends on #29287 to mergeTest Plan
gsm8k with 20 shot
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.