[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface#30684
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request refactors the MultiHeadAttention class, used for multimodal encoders, into a new MMEncoderAttention class, moving its definition to vllm/attention/layers/mm_encoder_attention.py and removing it from vllm/attention/layer.py. All instances and imports of MultiHeadAttention across various model implementations (e.g., AIMV2, BLIP, CLIP, GLM4V, Idefics2, InternViT, MLlama4, MoLMo, SigLIP, Step3-VL, Whisper) and their respective test files have been updated to use MMEncoderAttention. The MMEncoderAttention class now directly integrates Flash Attention backend selection logic and removes a redundant reshape_qkv_to_3d method. However, a review comment points out a critical issue in the torch_sdpa_wrapper within vllm/attention/ops/vit_attn_wrappers.py, where torch.split is incorrectly applied on the sequence length dimension (dim=1) for batched inputs, assuming packed tensors. This causes a dimension mismatch and will lead to errors, with the reviewer suggesting to split along the batch dimension (dim=0) or use an alternative approach for handling batched inputs with SDPA in variable-length attention.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Tests finally passed now. 😅 |
…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: #5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>
…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…erAttention` interface (vllm-project#30684) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (vllm-project/vllm#29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (vllm-project/vllm#30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: vllm-project#5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Purpose
MultiHeadAttentionusage to newMMEncoderAttentionTest Plan
Test Result
Test should pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.