Skip to content

[ROCm][Perf] Enabled FP4Indexer for DSV4#42908

Draft
tjtanaa wants to merge 12 commits into
vllm-project:mainfrom
EmbeddedLLM:dsv4fp4indexer
Draft

[ROCm][Perf] Enabled FP4Indexer for DSV4#42908
tjtanaa wants to merge 12 commits into
vllm-project:mainfrom
EmbeddedLLM:dsv4fp4indexer

Conversation

@tjtanaa

@tjtanaa tjtanaa commented May 18, 2026

Copy link
Copy Markdown
Member

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

tjtanaa added 8 commits May 15, 2026 21:56
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify mergify Bot added rocm Related to AMD ROCm v1 labels May 18, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 18, 2026
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ROCm support for MXFP4 quantization within the DeepSeek V4 sparse indexer, adding specialized Triton kernels for paged MQA logits and implementing optimizations for trivial top-k scenarios. Feedback from the review identified high-severity issues in the new FP4 MQA kernels, specifically regarding shape mismatches in tl.dot_scaled operations that necessitate transposing the RHS scale tensor.

Comment on lines +269 to +279
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
k_scale,
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tl.dot_scaled operation expects the RHS scale tensor to have a shape of (K_scaled, N) when rhs_k_pack=True. In this kernel, k_scale is loaded with shape (BLOCK_KV, 4), which corresponds to (N, K_scaled). This mismatch will likely lead to incorrect results or compilation errors. You should transpose k_scale before passing it to tl.dot_scaled.

Suggested change
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
k_scale,
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
tl.trans(k_scale),
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)

Comment on lines +387 to +397
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
k_scale,
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the paged kernel, tl.dot_scaled here expects the RHS scale to be (K_scaled, N). Since k_scale is loaded as (BLOCK_KV, 4), it needs to be transposed to match the expected (4, BLOCK_KV) shape.

Suggested change
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
k_scale,
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)
scores = tl.dot_scaled(
q_packed,
q_scale,
"e2m1",
k_packed,
tl.trans(k_scale),
"e2m1",
lhs_k_pack=True,
rhs_k_pack=True,
out_dtype=tl.float32,
)

tjtanaa added 3 commits May 18, 2026 09:50
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@mergify

mergify Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase rocm Related to AMD ROCm v1

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant