[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface by Isotr0py · Pull Request #30684 · vllm-project/vllm

Isotr0py · 2025-12-15T08:39:28Z

Purpose

Following PR for [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125
Migrate MultiHeadAttention usage to new MMEncoderAttention

Test Plan

pytest - s-v tests/kernels/attention/test_attention.py

pytest -s -v tests/kernels/attention/test_mha_attn.py

Test Result

Test should pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist

Code Review

This pull request refactors the MultiHeadAttention class, used for multimodal encoders, into a new MMEncoderAttention class, moving its definition to vllm/attention/layers/mm_encoder_attention.py and removing it from vllm/attention/layer.py. All instances and imports of MultiHeadAttention across various model implementations (e.g., AIMV2, BLIP, CLIP, GLM4V, Idefics2, InternViT, MLlama4, MoLMo, SigLIP, Step3-VL, Whisper) and their respective test files have been updated to use MMEncoderAttention. The MMEncoderAttention class now directly integrates Flash Attention backend selection logic and removes a redundant reshape_qkv_to_3d method. However, a review comment points out a critical issue in the torch_sdpa_wrapper within vllm/attention/ops/vit_attn_wrappers.py, where torch.split is incorrectly applied on the sequence length dimension (dim=1) for batched inputs, assuming packed tensors. This causes a dimension mismatch and will lead to errors, with the reviewer suggesting to split along the batch dimension (dim=0) or use an alternative approach for handling batched inputs with SDPA in variable-length attention.

vllm/attention/ops/vit_attn_wrappers.py

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2025-12-16T14:32:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2025-12-18T09:54:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py · 2025-12-18T16:04:02Z

Tests finally passed now. 😅

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: #5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

Isotr0py added 14 commits October 18, 2025 16:15

draft

3829886

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

85eb93e

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

348f3a4

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'upstream/main' into refactor-mm-attn

0ffe5a6

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean

4a003d7

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update usage

b5acd74

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'upstream/main' into refactor-mm-attn

312de33

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update

99e7097

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fix

5d5d6a0

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge branch 'main' into refactor-mm-attn

8702d74

fix codex

445367e

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

init

0f9de52

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean

2d90e23

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'origin/refactor-mm-attn' into migrate-vit

fb8179d

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot added llama Related to Llama models v1 tpu Related to Google TPUs labels Dec 15, 2025

update hunyuan

8fda5ad

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

vllm/attention/ops/vit_attn_wrappers.py Show resolved Hide resolved

Isotr0py added 5 commits December 15, 2025 17:33

fix 3d input

2789392

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

update docstring

ca04f76

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fix torch sdpa

22ed637

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fix test

ff672d9

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

fix

88e624a

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py mentioned this pull request Dec 16, 2025

[Core] WhisperEncoder support torch.compile #30549

Open

mergify bot added the needs-rebase label Dec 16, 2025

Merge remote-tracking branch 'upstream/main' into migrate-vit

08a6b9f

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot removed the needs-rebase label Dec 16, 2025

add varlen test

b36eaf5

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'upstream/main' into migrate-vit

b297d4e

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot removed the needs-rebase label Dec 17, 2025

Isotr0py requested a review from DarkLight1337 December 17, 2025 16:26

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025

Isotr0py added 2 commits December 18, 2025 01:37

ooops

5c0563c

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

revert mm encoder due to incorrect sync

5585491

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot added the needs-rebase label Dec 18, 2025

Merge remote-tracking branch 'upstream/main' into migrate-vit

1aa79ee

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify bot removed the needs-rebase label Dec 18, 2025

DarkLight1337 approved these changes Dec 18, 2025

View reviewed changes

DarkLight1337 merged commit 700a5ad into vllm-project:main Dec 18, 2025
58 of 59 checks passed

Isotr0py deleted the migrate-vit branch December 18, 2025 23:26

leo-pony mentioned this pull request Dec 22, 2025

update to vllm 12-19 vllm-project/vllm-ascend#5223

Merged

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new `MMEncod…

7980c31

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

iboiko-habana mentioned this pull request Dec 22, 2025

[FIX_FOR_VLLM_LATEST] Quick fix for PR30684 vllm-project/vllm-gaudi#742

Merged

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new `MMEncod…

2943b80

…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface#30684

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface#30684
DarkLight1337 merged 26 commits intovllm-project:mainfrom
Isotr0py:migrate-vit

Isotr0py commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

Isotr0py commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Isotr0py commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

Isotr0py commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Isotr0py commented Dec 15, 2025 •

edited by github-actions bot

Loading