Skip to content

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface#30684

Merged
DarkLight1337 merged 26 commits intovllm-project:mainfrom
Isotr0py:migrate-vit
Dec 18, 2025
Merged

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface#30684
DarkLight1337 merged 26 commits intovllm-project:mainfrom
Isotr0py:migrate-vit

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Dec 15, 2025

Purpose

Test Plan

pytest - s-v tests/kernels/attention/test_attention.py
pytest -s -v tests/kernels/attention/test_mha_attn.py

Test Result

Test should pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot added llama Related to Llama models v1 tpu Related to Google TPUs labels Dec 15, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MultiHeadAttention class, used for multimodal encoders, into a new MMEncoderAttention class, moving its definition to vllm/attention/layers/mm_encoder_attention.py and removing it from vllm/attention/layer.py. All instances and imports of MultiHeadAttention across various model implementations (e.g., AIMV2, BLIP, CLIP, GLM4V, Idefics2, InternViT, MLlama4, MoLMo, SigLIP, Step3-VL, Whisper) and their respective test files have been updated to use MMEncoderAttention. The MMEncoderAttention class now directly integrates Flash Attention backend selection logic and removes a redundant reshape_qkv_to_3d method. However, a review comment points out a critical issue in the torch_sdpa_wrapper within vllm/attention/ops/vit_attn_wrappers.py, where torch.split is incorrectly applied on the sequence length dimension (dim=1) for batched inputs, assuming packed tensors. This causes a dimension mismatch and will lead to errors, with the reviewer suggesting to split along the batch dimension (dim=0) or use an alternative approach for handling batched inputs with SDPA in variable-length attention.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link

mergify bot commented Dec 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 16, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot removed the needs-rebase label Dec 16, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot removed the needs-rebase label Dec 17, 2025
@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link

mergify bot commented Dec 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 18, 2025
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify bot removed the needs-rebase label Dec 18, 2025
@Isotr0py
Copy link
Member Author

Tests finally passed now. 😅

@DarkLight1337 DarkLight1337 merged commit 700a5ad into vllm-project:main Dec 18, 2025
58 of 59 checks passed
@Isotr0py Isotr0py deleted the migrate-vit branch December 18, 2025 23:26
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
…erAttention` interface (vllm-project#30684)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
adobrzyn pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Dec 23, 2025
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
iboiko-habana added a commit to iboiko-habana/vllm-gaudi that referenced this pull request Dec 23, 2025
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 23, 2025
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (vllm-project/vllm#29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (vllm-project/vllm#30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
#5297

### How was this patch tested?


Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…erAttention` interface (vllm-project#30684)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…erAttention` interface (vllm-project#30684)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…erAttention` interface (vllm-project#30684)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (vllm-project/vllm#29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (vllm-project/vllm#30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
vllm-project#5297

### How was this patch tested?

Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (vllm-project/vllm#29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (vllm-project/vllm#30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
vllm-project#5297

### How was this patch tested?

Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants