Skip to content

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649

Closed
ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
ROCm:ganyi/dsv3.2_further_opt
Closed

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649
ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
ROCm:ganyi/dsv3.2_further_opt

Conversation

@ganyi1996ppo
Copy link
Copy Markdown
Contributor

@ganyi1996ppo ganyi1996ppo commented Jan 20, 2026

Purpose

This PR move some of the original feature from #29287 to here. Includes some triton 3.5.0 depending kernel. And add more optimization on ROCMAiterMLASparseBackend. This PR depends on #29287 to merge

Test Plan

gsm8k with 20 shot

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 20 exact_match 0.9484 ± 0.0061
strict-match 20 exact_match 0.9484 ± 0.0061

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels Jan 20, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jan 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for Deepseek v3.2 on ROCm by adding new Triton kernels and a specialized backend. It also refactors the sparse_attn_indexer logic into a dedicated file, which is a good architectural improvement. However, I've identified a critical bug in the refactored CUDA path that could lead to an AttributeError, and a significant limitation in the new ROCm kernels due to a hardcoded value that restricts flexibility. Addressing these issues will improve the robustness and applicability of these optimizations.

):
return torch.ops.vllm.sparse_attn_indexer(
hidden_states,
self.k_cache.layer_prefix,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The k_cache object is of type DeepseekV32IndexerCache, which has a prefix attribute but not a layer_prefix attribute. Using self.k_cache.layer_prefix will result in an AttributeError. The HIP path correctly uses self.k_cache.prefix. This should be consistent.

Suggested change
self.k_cache.layer_prefix,
self.k_cache.prefix,

chunk.cu_seqlen_ke,
)
num_rows = logits.shape[0]
assert topk_tokens == 2048, "top_k_per_row assumes size 2048"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code asserts that topk_tokens must be 2048. This hardcoded value limits the flexibility of the sparse attention indexer. If this is a temporary limitation of the underlying custom C++ op, it should be noted with a TODO. For broader applicability, this should be made more flexible or at least provide a more informative error message if the value is unsupported.

)

num_rows = logits.shape[0]
assert topk_tokens == 2048, "top_k_per_row assumes size 2048"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the prefill path, the decode path also asserts that topk_tokens must be 2048. This hardcoded value is restrictive and should be generalized if possible to support other values.

@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 40265e8 to 9707e58 Compare January 20, 2026 09:17
@mergify mergify Bot removed the needs-rebase label Jan 20, 2026
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 9707e58 to e569fa2 Compare January 20, 2026 15:32
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jan 20, 2026
Signed-off-by: ganyi <ygan@amd.com>
whx-sjtu added a commit to ROCm/vllm that referenced this pull request Apr 22, 2026
Rebase of PR vllm-project#32649 onto main. Key changes:
- Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm
- Use Triton-based indexer_k_quant_and_cache and cp_gather functions
  instead of C++ ops for ROCm sparse attention indexer
- Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize
  params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits

Co-authored-by: ganyi <ygan@amd.com>
@whx-sjtu whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch from 51208a1 to ac122ee Compare April 22, 2026 12:18
@mergify mergify Bot removed the needs-rebase label Apr 22, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 22, 2026
whx-sjtu added a commit to ROCm/vllm that referenced this pull request Apr 22, 2026
Rebase of PR vllm-project#32649 onto main. Key changes:
- Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm
- Use Triton-based indexer_k_quant_and_cache and cp_gather functions
  instead of C++ ops for ROCm sparse attention indexer
- Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize
  params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits

Co-authored-by: ganyi <ygan@amd.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
@whx-sjtu whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch from ac122ee to 210821a Compare April 22, 2026 12:20
whx-sjtu added a commit to ROCm/vllm that referenced this pull request Apr 22, 2026
Rebase of PR vllm-project#32649 onto main. Key changes:
- Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm
- Use Triton-based indexer_k_quant_and_cache and cp_gather functions
  instead of C++ ops for ROCm sparse attention indexer
- Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize
  params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits

Co-authored-by: ganyi <ygan@amd.com>
@whx-sjtu whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch 2 times, most recently from 9bb16c2 to 51208a1 Compare April 22, 2026 15:46
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 22, 2026
frida-andersson added a commit to frida-andersson/vllm that referenced this pull request Apr 29, 2026
…ze=64 + SHUFFLE layout)

Based on vllm-project#32649 by @ganyi1996ppo — rebased onto
current main with structural adaptations.

Switch the ROCm sparse MLA backend from the stage1+reduce indexer path
to the gluon preshuffle kernel (deepgemm_fp8_paged_mqa_logits with
Preshuffle=True, KVBlockSize=64). This replaces a two-kernel pipeline
(deepgemm_fp8_paged_mqa_logits_stage1 + reduce<float,sum>) with a
single fused Triton kernel, yielding ~1 ms savings per decode iteration
on MI355X TP4 at 1K context.

Key changes:
- ROCMAiterMLASparseBackend now inherits from AiterMLABackend to reuse
  FP8 KV cache infrastructure (dtype support, prefill path, metadata)
- ROCMAiterMLASparseImpl inherits from AiterMLAImpl; forward_mqa
  overridden for sparse decode via mla_decode_fwd with topk indices
- Added FP8 casting + q_scale/k_scale passing in _forward_sparse_mla
- KV cache flattened for mla_decode_fwd when block_size > 1
- Triton indexer kernels use SHUFFLE layout (was NHD)
- rocm_fp8_paged_mqa_logits uses gluon API when block_size > 1,
  falls back to stage1 otherwise
- DeepseekV32IndexerBackend returns block_size=64 (was 1 on ROCm)
- Parent-allocated oversized buffers released in metadata builder
  __init__ to save ~52 MB/layer

Profiled result (1K input / 100 output, TP4 MI355X):
  Baseline: 21.9 ms/iter → Gluon: 18.2 ms/iter
  (includes run-to-run noise; conservative estimate ~1.5-2.0 ms real)

Accuracy (GSM8K 5-shot): 0.9121 vs 0.9424 baseline — 3pp regression
under investigation (likely FP8 scale handling or layout numerics).

Signed-off-by: frida-andersson <fanderss@amd.com>
@ganyi1996ppo
Copy link
Copy Markdown
Contributor Author

close this PR and replace it in #41217

@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models needs-rebase rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant