[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm by ganyi1996ppo · Pull Request #32649 · vllm-project/vllm

ganyi1996ppo · 2026-01-20T07:50:21Z

Purpose

This PR move some of the original feature from #29287 to here. Includes some triton 3.5.0 depending kernel. And add more optimization on ROCMAiterMLASparseBackend. This PR depends on #29287 to merge

Test Plan

gsm8k with 20 shot

Test Result

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	20	exact_match	↑	0.9484	±	0.0061
		strict-match	20	exact_match	↑	0.9484	±	0.0061

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-01-20T07:51:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces performance optimizations for Deepseek v3.2 on ROCm by adding new Triton kernels and a specialized backend. It also refactors the sparse_attn_indexer logic into a dedicated file, which is a good architectural improvement. However, I've identified a critical bug in the refactored CUDA path that could lead to an AttributeError, and a significant limitation in the new ROCm kernels due to a hardcoded value that restricts flexibility. Addressing these issues will improve the robustness and applicability of these optimizations.

gemini-code-assist · 2026-01-20T07:53:16Z

+    ):
+        return torch.ops.vllm.sparse_attn_indexer(
+            hidden_states,
+            self.k_cache.layer_prefix,


The k_cache object is of type DeepseekV32IndexerCache, which has a prefix attribute but not a layer_prefix attribute. Using self.k_cache.layer_prefix will result in an AttributeError. The HIP path correctly uses self.k_cache.prefix. This should be consistent.

Suggested change

self.k_cache.layer_prefix,

self.k_cache.prefix,

gemini-code-assist · 2026-01-20T07:53:16Z

+                chunk.cu_seqlen_ke,
+            )
+            num_rows = logits.shape[0]
+            assert topk_tokens == 2048, "top_k_per_row assumes size 2048"


The code asserts that topk_tokens must be 2048. This hardcoded value limits the flexibility of the sparse attention indexer. If this is a temporary limitation of the underlying custom C++ op, it should be noted with a TODO. For broader applicability, this should be made more flexible or at least provide a more informative error message if the value is unsupported.

gemini-code-assist · 2026-01-20T07:53:16Z

+        )
+
+        num_rows = logits.shape[0]
+        assert topk_tokens == 2048, "top_k_per_row assumes size 2048"


Similar to the prefill path, the decode path also asserts that topk_tokens must be 2048. This hardcoded value is restrictive and should be generalized if possible to support other values.

Signed-off-by: ganyi <ygan@amd.com>

mergify · 2026-01-20T15:32:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ganyi <ygan@amd.com>

Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com>

Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com> Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

Rebase of PR vllm-project#32649 onto main. Key changes: - Use SHUFFLE layout instead of NHD in Triton indexer kernels for ROCm - Use Triton-based indexer_k_quant_and_cache and cp_gather functions instead of C++ ops for ROCm sparse attention indexer - Use deepgemm_fp8_paged_mqa_logits (with Preshuffle/ChunkK/KVBlockSize params) instead of stage1+sum approach in rocm_fp8_paged_mqa_logits Co-authored-by: ganyi <ygan@amd.com>

mergify · 2026-04-22T15:47:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ganyi1996ppo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@ganyi1996ppo

…ze=64 + SHUFFLE layout) Based on vllm-project#32649 by @ganyi1996ppo — rebased onto current main with structural adaptations. Switch the ROCm sparse MLA backend from the stage1+reduce indexer path to the gluon preshuffle kernel (deepgemm_fp8_paged_mqa_logits with Preshuffle=True, KVBlockSize=64). This replaces a two-kernel pipeline (deepgemm_fp8_paged_mqa_logits_stage1 + reduce<float,sum>) with a single fused Triton kernel, yielding ~1 ms savings per decode iteration on MI355X TP4 at 1K context. Key changes: - ROCMAiterMLASparseBackend now inherits from AiterMLABackend to reuse FP8 KV cache infrastructure (dtype support, prefill path, metadata) - ROCMAiterMLASparseImpl inherits from AiterMLAImpl; forward_mqa overridden for sparse decode via mla_decode_fwd with topk indices - Added FP8 casting + q_scale/k_scale passing in _forward_sparse_mla - KV cache flattened for mla_decode_fwd when block_size > 1 - Triton indexer kernels use SHUFFLE layout (was NHD) - rocm_fp8_paged_mqa_logits uses gluon API when block_size > 1, falls back to stage1 otherwise - DeepseekV32IndexerBackend returns block_size=64 (was 1 on ROCm) - Parent-allocated oversized buffers released in metadata builder __init__ to save ~52 MB/layer Profiled result (1K input / 100 output, TP4 MI355X): Baseline: 21.9 ms/iter → Gluon: 18.2 ms/iter (includes run-to-run noise; conservative estimate ~1.5-2.0 ms real) Accuracy (GSM8K 5-shot): 0.9121 vs 0.9424 baseline — 3pp regression under investigation (likely FP8 scale handling or layout numerics). Signed-off-by: frida-andersson <fanderss@amd.com>

ganyi1996ppo · 2026-04-30T02:00:23Z

close this PR and replace it in #41217

mergify Bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm v1 labels Jan 20, 2026

mergify Bot added the needs-rebase label Jan 20, 2026

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 40265e8 to 9707e58 Compare January 20, 2026 09:17

mergify Bot removed the needs-rebase label Jan 20, 2026

ganyi1996ppo added 7 commits January 20, 2026 09:22

enable mla_asm in sparse_mla backend

c1d785c

Signed-off-by: ganyi <ygan@amd.com>

refactor the SparseAttnIndexer as CustomOp

57b0f87

Signed-off-by: ganyi <ygan@amd.com>

raise NotImplementedError for other platform

608e41f

Signed-off-by: ganyi <ygan@amd.com>

remove import

439f16c

Signed-off-by: ganyi <ygan@amd.com>

further optimize dsv3.2

d243f6f

Signed-off-by: ganyi <ygan@amd.com>

make gluon impl as default

803117e

Signed-off-by: ganyi <ygan@amd.com>

fix sparse len calculation issue

e569fa2

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo force-pushed the ganyi/dsv3.2_further_opt branch from 9707e58 to e569fa2 Compare January 20, 2026 15:32

mergify Bot added the needs-rebase label Jan 20, 2026

ganyi1996ppo added 2 commits January 27, 2026 02:18

fix ptpc scale load issue for fused shared expert path in deepseek mtp

9b8f5ef

Signed-off-by: ganyi <ygan@amd.com>

fp8 kvcache support

51208a1

Signed-off-by: ganyi <ygan@amd.com>

whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch from 51208a1 to ac122ee Compare April 22, 2026 12:18

mergify Bot removed the needs-rebase label Apr 22, 2026

github-project-automation Bot added this to AMD Apr 22, 2026

github-project-automation Bot moved this to Todo in AMD Apr 22, 2026

whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch from ac122ee to 210821a Compare April 22, 2026 12:20

whx-sjtu force-pushed the ganyi/dsv3.2_further_opt branch 2 times, most recently from 9bb16c2 to 51208a1 Compare April 22, 2026 15:46

mergify Bot added the needs-rebase label Apr 22, 2026

frida-andersson mentioned this pull request Apr 27, 2026

[ROCm][DeepSeek-V3.2][Perf] Enable gluon preshuffle indexer (block_size=64 + SHUFFLE layout) #41008

Closed

3 tasks

ganyi1996ppo mentioned this pull request Apr 29, 2026

[ROCm][Deepseek] dsv3.2 further optimization #41217

Merged

4 tasks

ganyi1996ppo closed this Apr 30, 2026

github-project-automation Bot moved this from Todo to Done in AMD Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649

[ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm#32649
ganyi1996ppo wants to merge 9 commits intovllm-project:mainfrom
ROCm:ganyi/dsv3.2_further_opt

ganyi1996ppo commented Jan 20, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mergify Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Uh oh!

mergify Bot commented Jan 20, 2026

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

ganyi1996ppo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ganyi1996ppo commented Jan 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jan 20, 2026

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

ganyi1996ppo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ganyi1996ppo commented Jan 20, 2026 •

edited by github-actions Bot

Loading