[XPU] support MLA model on Intel GPU by jikunshang · Pull Request #37143 · vllm-project/vllm

jikunshang · 2026-03-16T06:10:28Z

Purpose

before this PR, we can enable MLA model by export VLLM_MLA_DISABLE=1, which will always fall back to MHA backend.
this PR will use FLASH_ATTN for prefill and TRITON_MLA for decode.

Test Plan

python3 examples/basic/offline_inference/generate.py --model deepseek-ai/DeepSeek-V2-Lite  --enforce-eager --temperature 0  -tp 2

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for MLA models on Intel GPUs by enabling flash attention for prefill and Triton MLA for decode on the XPU platform. The changes involve updating XPU custom operations and adding platform-specific logic in the MLA attention and quantization layers. I've found a critical issue in the implementation of forward_xpu in the QuantFP8 layer that could lead to runtime errors. My review includes a suggestion to fix this.

gemini-code-assist · 2026-03-16T06:14:34Z

vllm/model_executor/layers/quantization/input_quant_fp8.py

+    def forward_xpu(
+        self,
+        x: torch.Tensor,
+        scale: torch.Tensor | None = None,
+        scale_ub: torch.Tensor | None = None,
+        use_triton: bool = False,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        # XPU currently only supports native implementation.
+        return self.forward_cuda(x, scale, scale_ub, use_triton)


The implementation of forward_xpu calls self.forward_cuda, but the accompanying comment states that "XPU currently only supports native implementation." This is contradictory and can lead to a critical runtime error.

Specifically, when this method is used by subclasses like _DecodeConcatQuantFP8 (in vllm/model_executor/layers/attention/mla_attention.py), which overrides forward_cuda with a different method signature, calling self.forward_cuda from the base class's forward_xpu will cause a parameter mismatch and a crash.

To fix this and align with the comment's intent, forward_xpu should call self.forward_native instead. This will correctly dispatch to the native PyTorch implementation, which is also correctly wrapped by subclasses like _DecodeConcatQuantFP8.

Suggested change

def forward_xpu(

self,

x: torch.Tensor,

scale: torch.Tensor | None = None,

scale_ub: torch.Tensor | None = None,

use_triton: bool = False,

) -> tuple[torch.Tensor, torch.Tensor]:

# XPU currently only supports native implementation.

return self.forward_cuda(x, scale, scale_ub, use_triton)

def forward_xpu(

self,

x: torch.Tensor,

scale: torch.Tensor | None = None,

scale_ub: torch.Tensor | None = None,

use_triton: bool = False,

) -> tuple[torch.Tensor, torch.Tensor]:

# XPU currently only supports native implementation.

return self.forward_native(x, scale, scale_ub, use_triton)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

jikunshang requested review from LucasWilkinson, MatthewBonanni, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners March 16, 2026 06:10

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

yma11 approved these changes Mar 16, 2026

View reviewed changes

jikunshang mentioned this pull request Mar 19, 2026

fix(xpu): Re-compute compile ranges after platform-specific config updates #37523

Merged

jikunshang added 3 commits March 24, 2026 20:58

fix

5b04079

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix comments

282fab1

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update mla config for xpu

9a7becb

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

jikunshang force-pushed the kunshang/mla_support branch from def498e to 9a7becb Compare March 25, 2026 06:03

jikunshang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026

bigPYJ1151 approved these changes Mar 25, 2026

View reviewed changes

jikunshang merged commit 14771f7 into vllm-project:main Mar 25, 2026
69 checks passed

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[XPU] support MLA model on Intel GPU (vllm-project#37143)

d827e11

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU] support MLA model on Intel GPU#37143

[XPU] support MLA model on Intel GPU#37143
jikunshang merged 3 commits intovllm-project:mainfrom
jikunshang:kunshang/mla_support

jikunshang commented Mar 16, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jikunshang commented Mar 16, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jikunshang commented Mar 16, 2026 •

edited by github-actions bot

Loading