[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs by AndreasKaratzas · Pull Request #31596 · vllm-project/vllm

AndreasKaratzas · 2026-01-01T02:24:54Z

Fixes a regression introduced in #28775 where the output_shape calculation in Attention.forward() assumes 2D query input, causing failures when models pass 3D query tensors.

Problem

The new code added in #28775:

if output_shape is None:
    output_shape = torch.Size((*query.shape[:-1], self.num_heads * self.head_size_v))

This breaks when query is 3D [num_tokens, num_heads, head_dim]:

query.shape[:-1] = (num_tokens, num_heads)
output_shape = (num_tokens, num_heads, hidden_size) ← incorrect

This causes DeepseekV2Attention (used by DeepSeek-V2/V3 with MLA disabled and MTP layers) to produce incorrect output shapes, leading to downstream failures in FP8 quantization kernels.

Fix

Use query.shape[0] to always get num_tokens, making the calculation robust to both 2D and 3D inputs:

if output_shape is None:
    num_tokens = query.shape[0]
    output_shape = torch.Size((num_tokens, self.num_heads * self.head_size_v))

Testing

On ROCm (8 x mi325 machine):

pytest -v -s tests/v1/e2e/test_spec_decode.py::test_mtp_correctness[deepseek]
pytest -v -s tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[TRITON_ATTN-deepseek_eagle]

Fixes issue observed on ROCm but the bug exists on all platforms.

…ery inputs Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-01-01T02:25:21Z

cc @DarkLight1337 @yt0428

gemini-code-assist

Code Review

The pull request modifies the vllm/attention/layer.py file to adjust the output_shape calculation in the forward method, explicitly extracting num_tokens from the query shape to handle both 2D and 3D query inputs. Additionally, in vllm/model_executor/layers/quantization/fp8.py, the DeepGEMM backend activation logic was updated to first respect the user's VLLM_USE_DEEP_GEMM setting and then disable DeepGEMM if the platform does not support it, logging an informational message. A review comment highlighted that the DeepGEMM logging logic could be misleading; if a user explicitly disables DeepGEMM, the current message incorrectly attributes the disabling to platform incompatibility. The suggested fix proposes to only log the platform incompatibility message if DeepGEMM was initially requested by the user but is not supported by the platform.

gemini-code-assist · 2026-01-01T02:27:25Z

vllm/model_executor/layers/quantization/fp8.py

+    if not is_deep_gemm_supported():
+        use_deep_gemm = False
+        logger.info_once(
+            "DeepGEMM is disabled because the platform does not support it.",
+            scope="local",
+        )


The current logic for checking DeepGEMM support can produce a misleading log message. If a user explicitly disables DeepGEMM by setting VLLM_USE_DEEP_GEMM=0, is_deep_gemm_supported() will return False, causing the message "DeepGEMM is disabled because the platform does not support it" to be logged. This is inaccurate because the user disabled it, not the platform.

The check should only log a message if the user intended to use DeepGEMM, but it's not supported by the platform. I've suggested a change to correct this logic and make the log message more precise.

Suggested change

if not is_deep_gemm_supported():

use_deep_gemm = False

logger.info_once(

"DeepGEMM is disabled because the platform does not support it.",

scope="local",

)

if use_deep_gemm and not is_deep_gemm_supported():

use_deep_gemm = False

logger.info_once(

"DeepGEMM was requested but is disabled because the platform does not support it.",

scope="local",

)

This is effectively the next check that is performed. And the message in the next if statement is the same with the proposed one. So I think this modification is unnecessary.

LucasWilkinson · 2026-01-01T17:04:47Z

vllm/model_executor/layers/quantization/fp8.py

+        logger.info_once(
+            "DeepGEMM is disabled because the platform does not support it.",
+            scope="local",
+        )


These changes are unrelated to the intent of the PR; why did you add this?

I can move it to a different PR if that's what you are asking. On ROCm right now the message logged during a run is that DeepGemm is requested but not found, which is not that accurate because DeepGemm is not a ROCm supported feature. So I put together this short block that renders a more precise check first.

got it, thanks 👍

LucasWilkinson · 2026-01-02T15:17:51Z

vllm/model_executor/layers/quantization/fp8.py

+        logger.info_once(
+            "DeepGEMM is disabled because the platform does not support it.",
+            scope="local",
+        )


got it, thanks 👍

LucasWilkinson · 2026-01-02T15:18:09Z

LGTM, thanks for the contribution!

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…im] (vllm-project#32274) Summary: The breakage was introduced in D89937241(vllm-project#28775) and D90045073(vllm-project#31596). We will see reshaping errors with the return values of the attention layer. When the query shape is 4D, [batch_size, num_tokens, num_heads, head_dim], the output shape will be composed as [batch_size, num_heads * head_dim] however the correct shape should be [batch_size, num_tokens, num_heads * head_dim] instead. Test Plan: Patched this diff and tested vllm local services, it worked with no issue. Reviewed By: frank-wei Differential Revision: D90600898

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…

cd204bc

…ery inputs Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested review from LucasWilkinson, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 1, 2026 02:24

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

tjtanaa added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Jan 1, 2026

LucasWilkinson requested changes Jan 1, 2026

View reviewed changes

LucasWilkinson approved these changes Jan 2, 2026

View reviewed changes

LucasWilkinson enabled auto-merge (squash) January 2, 2026 15:18

LucasWilkinson merged commit 6ef770d into vllm-project:main Jan 2, 2026
60 of 61 checks passed

AndreasKaratzas deleted the akaratza_fix_moe_out_shape branch January 2, 2026 17:40

LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026

[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…

89f9ef4

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…

673a93f

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

henryoier mentioned this pull request Jan 13, 2026

Fix Attention when query dim=4 [batch_size, num_tokens, heads, head_dim] #32274

Open

Lucaskabela mentioned this pull request Jan 14, 2026

[Experimental][rl][unified] Update infer.py example to work with vLLM nightly pytorch/torchtitan#2226

Merged

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…

3fe6bb2

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…

f619d74

…ery inputs (vllm-project#31596) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs#31596

[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs#31596
LucasWilkinson merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_fix_moe_out_shape

AndreasKaratzas commented Jan 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

AndreasKaratzas commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 1, 2026

Uh oh!

AndreasKaratzas Jan 1, 2026

Uh oh!

LucasWilkinson Jan 1, 2026

Uh oh!

AndreasKaratzas Jan 1, 2026

Uh oh!

LucasWilkinson Jan 2, 2026

Uh oh!

LucasWilkinson Jan 2, 2026

Uh oh!

LucasWilkinson commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AndreasKaratzas commented Jan 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

AndreasKaratzas commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreasKaratzas commented Jan 1, 2026 •

edited by github-actions bot

Loading