[BugFix] Cosmos-Reason1-7B Model Flash Attention requires head dim to be a multiple of 32#29615
[BugFix] Cosmos-Reason1-7B Model Flash Attention requires head dim to be a multiple of 32#29615ECMGit wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for a crash that occurs when using FlashAttention with a head dimension that is not a multiple of 32 in the Qwen2.5-VL vision encoder. The change correctly detects this condition and falls back to the TORCH_SDPA backend, which is a robust solution to prevent the runtime error. The implementation is clear and correctly placed within the Qwen2_5_VisionAttention module's initialization. The fix is well-contained and effectively resolves the reported issue.
| AttentionBackendEnum.ROCM_AITER_FA, | ||
| } and self.hidden_size_per_attention_head % 32 != 0: | ||
| logger.warning( | ||
| f"Flash attention backend requires head_dim to be a multiple of 32, " |
| if self.attn_backend in { | ||
| AttentionBackendEnum.FLASH_ATTN, | ||
| AttentionBackendEnum.ROCM_AITER_FA, | ||
| } and self.hidden_size_per_attention_head % 32 != 0: |
There was a problem hiding this comment.
I think #28763 has added head_size=80 (used by cosmos-7b's ViT) support to FA. Can you try to install the latest nightly wheel?
There was a problem hiding this comment.
Thanks! I have verify this bug does not occur in latest nightly build, verify to close this PR and issue.
| self.use_upstream_fa = False | ||
| # Flash attention requires head_dim to be a multiple of 32 | ||
| # Fall back to TORCH_SDPA if the head dimension is incompatible | ||
| if self.attn_backend in { |
There was a problem hiding this comment.
@ECMGit can you skip this check on rocm for now? We have different conditions.
LucasWilkinson
left a comment
There was a problem hiding this comment.
+1 to this: https://github.com/vllm-project/vllm/pull/29615/files#r2569351681 it FA should support multiples of 8 now
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Fix the issue reported in #29417, Flash attention backend requires head_dim to be a multiple of 32 in
vllm/model_executor/models/qwen2_5_vl.py, Fall back to TORCH_SDPA backend as a workaround to fix this issue.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.