FA4 attention for ViT by zhandaz · Pull Request #29 · CentML/vllm

zhandaz · 2026-01-23T19:14:52Z

FA4 Integration

(1) Support fa4 in vllm.

From low-level to high-level:

Add FLASH_ATTN_CUTE (FA4 / flash_attn.cute) to vllm/v1/attention/backends/registry.py (AttentionBackendEnum).
Create a new file vllm/v1/attention/backends/fa4_utils.py, for the utils / imports for fa4 (keep imports lazy).
Register the new backend in vllm/platforms/cuda.py (FA4 is Blackwell-only (CC 10.x) and opt-in via --mm-encoder-attn-backend FLASH_ATTN_CUTE; default remains FA2/3 or Torch SDPA).
Add the fa4 custom op under vllm/v1/attention/ops/vit_attn_wrappers.py.
Update vllm/model_executor/layers/attention/mm_encoder_attention.py to add another _forward_impl method for fa4 (FLASH_ATTN_CUTE).
Update vllm/model_executor/models/qwen3_vl.py and (optionally) qwen2_5_vl.py to accept FLASH_ATTN_CUTE and compute max_seqlen for it.

Notes:

FA4 (flash_attn.cute) is only considered on Blackwell (compute capability 10.x) in this vLLM fork.
To force FA4 for ViT/MM encoder attention (Blackwell only): --mm-encoder-attn-backend FLASH_ATTN_CUTE.

(2) Do the kernel_warmup in vllm.

Add a FA4 ViT warmup in vllm/model_executor/warmup/kernel_warmup.py (see vllm/model_executor/warmup/fa4_warmup.py).
Scope: Qwen3-VL / Qwen3-VL-MoE vision transformer only, Blackwell-only, and only when --mm-encoder-attn-backend FLASH_ATTN_CUTE is set.
Candidate seqlens (only varying seqlen): [64, 256, 576, 1024, 2304, 4096, 9216, 16384, 36864, 65536] (filtered by vision_config.num_position_embeddings if smaller).

(3) Minor fixes for FA4 integration.

In vllm/model_executor/layers/rotary_embedding/common.py, there is a logic of if find_spec("flash_attn") is not None:
However, flash_attn original package is actually not installed, not flash_attn.cute is installed.
Therefore, minor fix is needed for the import error.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

[Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update

…el image" This reverts commit ab76b28.

github-actions · 2026-01-23T19:15:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

wangshangsam

Could you run ruff check and ruff format (you can see that the pre-commit action is failing)? Otherwise, LGTM

zhandaz · 2026-01-24T21:47:47Z

Sure. Let me fix it.

…reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9.

* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image [Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update * feat: Support FA4 for mm-encoder-attn-backend for qwen models * feat: Kernel warmup for vit fa4 * fix: Fix some minor conflicts due to the introduction of flash_attn.cute * Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image" This reverts commit ab76b28. * chore: Update requirements and revert README.md * chore: Install git for flash_attn cute installation * lint: Fix linting * Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9. --------- Co-authored-by: Shang Wang <shangw@nvidia.com>

zhandaz added 6 commits January 21, 2026 11:57

[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image

ab76b28

[Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image Further update

feat: Support FA4 for mm-encoder-attn-backend for qwen models

2dc7419

feat: Kernel warmup for vit fa4

c0de260

fix: Fix some minor conflicts due to the introduction of flash_attn.cute

c87d642

Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 dev…

a7d29c0

…el image" This reverts commit ab76b28.

chore: Update requirements and revert README.md

911292e

chore: Install git for flash_attn cute installation

f6e3ae7

zhandaz force-pushed the zhanda/mlperf-inf-mm-q3vl-v6.0 branch from b8372c5 to f6e3ae7 Compare January 23, 2026 20:06

zhandaz changed the title ~~Zhanda/mlperf inf mm q3vl v6.0~~ FA4 attention for ViT Jan 24, 2026

wangshangsam self-requested a review January 24, 2026 21:32

wangshangsam assigned zhandaz Jan 24, 2026

wangshangsam previously approved these changes Jan 24, 2026

View reviewed changes

wangshangsam requested review from b-mu and maxyanghu January 24, 2026 21:35

lint: Fix linting

034e03a

zhandaz dismissed wangshangsam’s stale review via 034e03a January 25, 2026 01:17

Revert "[Improvement] Persist CUDA compat libraries paths to prevent …

f66762f

…reset on `apt-get` (vllm-project#30784)" (#31) This reverts commit 2a60ac9.

zhandaz requested a review from wangshangsam January 25, 2026 01:28

wangshangsam approved these changes Jan 25, 2026

View reviewed changes

wangshangsam merged commit 72d77dc into mlperf-inf-mm-q3vl-v6.0 Jan 25, 2026
1 check passed

zhandaz deleted the zhanda/mlperf-inf-mm-q3vl-v6.0 branch January 25, 2026 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA4 attention for ViT#29

FA4 attention for ViT#29
wangshangsam merged 9 commits intomlperf-inf-mm-q3vl-v6.0from
zhanda/mlperf-inf-mm-q3vl-v6.0

zhandaz commented Jan 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

wangshangsam left a comment

Uh oh!

zhandaz commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhandaz commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FA4 Integration

(1) Support fa4 in vllm.

(2) Do the kernel_warmup in vllm.

(3) Minor fixes for FA4 integration.

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

wangshangsam left a comment

Choose a reason for hiding this comment

Uh oh!

zhandaz commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhandaz commented Jan 23, 2026 •

edited by github-actions bot

Loading