FA4 attention for ViT#29
Conversation
[Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update
…el image" This reverts commit ab76b28.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
b8372c5 to
f6e3ae7
Compare
wangshangsam
left a comment
There was a problem hiding this comment.
Could you run ruff check and ruff format (you can see that the pre-commit action is failing)? Otherwise, LGTM
|
Sure. Let me fix it. |
…reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9.
* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image [Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update * feat: Support FA4 for mm-encoder-attn-backend for qwen models * feat: Kernel warmup for vit fa4 * fix: Fix some minor conflicts due to the introduction of flash_attn.cute * Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image" This reverts commit ab76b28. * chore: Update requirements and revert README.md * chore: Install git for flash_attn cute installation * lint: Fix linting * Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9. --------- Co-authored-by: Shang Wang <shangw@nvidia.com>
FA4 Integration
(1) Support fa4 in vllm.
From low-level to high-level:
FLASH_ATTN_CUTE(FA4 /flash_attn.cute) tovllm/v1/attention/backends/registry.py(AttentionBackendEnum).vllm/v1/attention/backends/fa4_utils.py, for the utils / imports for fa4 (keep imports lazy).vllm/platforms/cuda.py(FA4 is Blackwell-only (CC 10.x) and opt-in via--mm-encoder-attn-backend FLASH_ATTN_CUTE; default remains FA2/3 or Torch SDPA).vllm/v1/attention/ops/vit_attn_wrappers.py.vllm/model_executor/layers/attention/mm_encoder_attention.pyto add another _forward_impl method for fa4 (FLASH_ATTN_CUTE).vllm/model_executor/models/qwen3_vl.pyand (optionally)qwen2_5_vl.pyto acceptFLASH_ATTN_CUTEand computemax_seqlenfor it.Notes:
flash_attn.cute) is only considered on Blackwell (compute capability 10.x) in this vLLM fork.--mm-encoder-attn-backend FLASH_ATTN_CUTE.(2) Do the kernel_warmup in vllm.
vllm/model_executor/warmup/kernel_warmup.py(seevllm/model_executor/warmup/fa4_warmup.py).--mm-encoder-attn-backend FLASH_ATTN_CUTEis set.[64, 256, 576, 1024, 2304, 4096, 9216, 16384, 36864, 65536](filtered byvision_config.num_position_embeddingsif smaller).(3) Minor fixes for FA4 integration.
vllm/model_executor/layers/rotary_embedding/common.py, there is a logic ofif find_spec("flash_attn") is not None:However, flash_attn original package is actually not installed, not
flash_attn.cuteis installed.Therefore, minor fix is needed for the import error.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.