[trainer] fix: fallback vision tower to flash_attention_2 for Qwen2.5-VL when u… by aoshen524 · Pull Request #4670 · verl-project/verl

aoshen524 · 2025-12-25T12:54:19Z

Fix: Fallback Vision Tower to Flash Attention 2 for Qwen2.5-VL when using Flash Attention 3

Description

This PR adds a patch for Qwen2.5-VL models to fallback the vision tower's attention implementation to flash_attention_2 when the main model uses flash_attention_3.

Motivation

Qwen2.5-VL's vision tower does not support flash_attention_3 properly. When attn_implementation is set to flash_attention_3, using FA3 for the vision tower causes significant performance degradation compared to flash_attention_2.

Experimental Validation

We have tested this patch across the entire Qwen2.5-VL series (3B, 7B, 32B, and 72B models) using the Transformers library on an 8×H100 GPU machine with auto device placement.

Below is the performance comparison for Qwen2.5-VL-7B with input of one 1260×700 image + 150 tokens of text:

======================================================================
COMPARISON SUMMARY
======================================================================

Implementation            Avg Latency (ms)   Throughput (tok/s)
-------------------------------------------------------------
flash_attention_2         102.85             12503.46      
flash_attention_3         309.49             4155.19              

FA3 vs FA2 Speedup: 0.33x
Memory Difference: +0.00 GB

Test Environment:

Hardware: 8×H100 GPUs
Library: Transformers with auto device placement
Models tested: Qwen2.5-VL-3B, 7B, 32B, 72B

Key Findings:

Flash Attention 3 is 3x slower than Flash Attention 2 for the vision tower
No memory benefit from using FA3 for vision components
Consistent behavior observed across all model sizes (3B, 7B, 32B, 72B)

Changes

Added a check for qwen2_5_vl model type
When attn_implementation == "flash_attention_3", automatically set actor_model_config.vision_config._attn_implementation = "flash_attention_2" for the vision tower
This allows the language model to use FA3 while the vision tower uses FA2, achieving optimal performance

Impact

This change ensures that Qwen2.5-VL models can benefit from flash_attention_3 for text processing while maintaining optimal performance for vision encoding.

Technical Details

The patch is applied in verl/workers/fsdp_workers.py in the _build_model_optimizer method:

# patch for qwen2.5-vl: when using flash_attention_3, set vision tower to use flash_attention_2
# because the vision tower does not support flash_attention_3
if (
    getattr(actor_model_config, "model_type", None) == "qwen2_5_vl"
    and attn_implementation == "flash_attention_3"
    and hasattr(actor_model_config, "vision_config")
):
    actor_model_config.vision_config._attn_implementation = "flash_attention_2"

Testing

Tested on:

Qwen2.5-VL-3B
Qwen2.5-VL-7B
Qwen2.5-VL-32B
Qwen2.5-VL-72B

All models show consistent performance improvements with this patch when using flash_attention_3 for the language model.

…sing flash_attention_3 Qwen2.5-VL vision tower does not support flash_attention_3, so when attn_implementation is set to flash_attention_3, we need to set the vision tower's _attn_implementation to flash_attention_2 instead.

gemini-code-assist

Code Review

The pull request introduces a code change in verl/workers/fsdp_workers.py within the _build_model_optimizer function. This change adds a specific patch for the qwen2_5_vl model. If the model type is qwen2_5_vl and the attention implementation is set to flash_attention_3, the patch overrides the vision tower's attention implementation to flash_attention_2, as the vision tower does not support flash_attention_3.

…-VL when u… (verl-project#4670) # Fix: Fallback Vision Tower to Flash Attention 2 for Qwen2.5-VL when using Flash Attention 3 ## Description This PR adds a patch for Qwen2.5-VL models to fallback the vision tower's attention implementation to flash_attention_2 when the main model uses flash_attention_3. ## Motivation Qwen2.5-VL's vision tower does not support flash_attention_3 properly. When `attn_implementation` is set to `flash_attention_3`, using FA3 for the vision tower causes significant performance degradation compared to flash_attention_2. ## Experimental Validation We have tested this patch across the entire Qwen2.5-VL series (3B, 7B, 32B, and 72B models) using the Transformers library on an 8×H100 GPU machine with auto device placement. Below is the performance comparison for Qwen2.5-VL-7B with input of one 1260×700 image + 150 tokens of text: ``` ====================================================================== COMPARISON SUMMARY ====================================================================== Implementation Avg Latency (ms) Throughput (tok/s) ------------------------------------------------------------- flash_attention_2 102.85 12503.46 flash_attention_3 309.49 4155.19 FA3 vs FA2 Speedup: 0.33x Memory Difference: +0.00 GB ``` **Test Environment:** - Hardware: 8×H100 GPUs - Library: Transformers with auto device placement - Models tested: Qwen2.5-VL-3B, 7B, 32B, 72B **Key Findings:** - Flash Attention 3 is **3x slower** than Flash Attention 2 for the vision tower - No memory benefit from using FA3 for vision components - Consistent behavior observed across all model sizes (3B, 7B, 32B, 72B) ## Changes - Added a check for `qwen2_5_vl` model type - When `attn_implementation == "flash_attention_3"`, automatically set `actor_model_config.vision_config._attn_implementation = "flash_attention_2"` for the vision tower - This allows the language model to use FA3 while the vision tower uses FA2, achieving optimal performance ## Impact This change ensures that Qwen2.5-VL models can benefit from flash_attention_3 for text processing while maintaining optimal performance for vision encoding. ## Technical Details The patch is applied in `verl/workers/fsdp_workers.py` in the `_build_model_optimizer` method: ```python # patch for qwen2.5-vl: when using flash_attention_3, set vision tower to use flash_attention_2 # because the vision tower does not support flash_attention_3 if ( getattr(actor_model_config, "model_type", None) == "qwen2_5_vl" and attn_implementation == "flash_attention_3" and hasattr(actor_model_config, "vision_config") ): actor_model_config.vision_config._attn_implementation = "flash_attention_2" ``` ## Testing Tested on: - Qwen2.5-VL-3B - Qwen2.5-VL-7B - Qwen2.5-VL-32B - Qwen2.5-VL-72B All models show consistent performance improvements with this patch when using flash_attention_3 for the language model.

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

wuxibin89 changed the title ~~fix: fallback vision tower to flash_attention_2 for Qwen2.5-VL when u…~~ [trainer] fix: fallback vision tower to flash_attention_2 for Qwen2.5-VL when u… Dec 29, 2025

wuxibin89 approved these changes Dec 29, 2025

View reviewed changes

wuxibin89 merged commit cd4072d into verl-project:main Dec 29, 2025
47 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] fix: fallback vision tower to flash_attention_2 for Qwen2.5-VL when u…#4670

[trainer] fix: fallback vision tower to flash_attention_2 for Qwen2.5-VL when u…#4670
wuxibin89 merged 1 commit intoverl-project:mainfrom
aoshen524:fix/qwen2.5vl-flash-attention-3-vit-fallback

aoshen524 commented Dec 25, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aoshen524 commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Fallback Vision Tower to Flash Attention 2 for Qwen2.5-VL when using Flash Attention 3

Description

Motivation

Experimental Validation

Changes

Impact

Technical Details

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aoshen524 commented Dec 25, 2025 •

edited

Loading