Skip to content

FA4 attention for ViT#29

Merged
wangshangsam merged 9 commits intomlperf-inf-mm-q3vl-v6.0from
zhanda/mlperf-inf-mm-q3vl-v6.0
Jan 25, 2026
Merged

FA4 attention for ViT#29
wangshangsam merged 9 commits intomlperf-inf-mm-q3vl-v6.0from
zhanda/mlperf-inf-mm-q3vl-v6.0

Conversation

@zhandaz
Copy link
Copy Markdown

@zhandaz zhandaz commented Jan 23, 2026

FA4 Integration

(1) Support fa4 in vllm.

From low-level to high-level:

  1. Add FLASH_ATTN_CUTE (FA4 / flash_attn.cute) to vllm/v1/attention/backends/registry.py (AttentionBackendEnum).
  2. Create a new file vllm/v1/attention/backends/fa4_utils.py, for the utils / imports for fa4 (keep imports lazy).
  3. Register the new backend in vllm/platforms/cuda.py (FA4 is Blackwell-only (CC 10.x) and opt-in via --mm-encoder-attn-backend FLASH_ATTN_CUTE; default remains FA2/3 or Torch SDPA).
  4. Add the fa4 custom op under vllm/v1/attention/ops/vit_attn_wrappers.py.
  5. Update vllm/model_executor/layers/attention/mm_encoder_attention.py to add another _forward_impl method for fa4 (FLASH_ATTN_CUTE).
  6. Update vllm/model_executor/models/qwen3_vl.py and (optionally) qwen2_5_vl.py to accept FLASH_ATTN_CUTE and compute max_seqlen for it.

Notes:

  • FA4 (flash_attn.cute) is only considered on Blackwell (compute capability 10.x) in this vLLM fork.
  • To force FA4 for ViT/MM encoder attention (Blackwell only): --mm-encoder-attn-backend FLASH_ATTN_CUTE.

(2) Do the kernel_warmup in vllm.

  • Add a FA4 ViT warmup in vllm/model_executor/warmup/kernel_warmup.py (see vllm/model_executor/warmup/fa4_warmup.py).
  • Scope: Qwen3-VL / Qwen3-VL-MoE vision transformer only, Blackwell-only, and only when --mm-encoder-attn-backend FLASH_ATTN_CUTE is set.
  • Candidate seqlens (only varying seqlen): [64, 256, 576, 1024, 2304, 4096, 9216, 16384, 36864, 65536] (filtered by vision_config.num_position_embeddings if smaller).

(3) Minor fixes for FA4 integration.

  • In vllm/model_executor/layers/rotary_embedding/common.py, there is a logic of if find_spec("flash_attn") is not None:
    However, flash_attn original package is actually not installed, not flash_attn.cute is installed.
    Therefore, minor fix is needed for the import error.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@zhandaz zhandaz force-pushed the zhanda/mlperf-inf-mm-q3vl-v6.0 branch from b8372c5 to f6e3ae7 Compare January 23, 2026 20:06
@zhandaz zhandaz changed the title Zhanda/mlperf inf mm q3vl v6.0 FA4 attention for ViT Jan 24, 2026
@wangshangsam wangshangsam self-requested a review January 24, 2026 21:32
wangshangsam
wangshangsam previously approved these changes Jan 24, 2026
Copy link
Copy Markdown

@wangshangsam wangshangsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run ruff check and ruff format (you can see that the pre-commit action is failing)? Otherwise, LGTM

@zhandaz
Copy link
Copy Markdown
Author

zhandaz commented Jan 24, 2026

Sure. Let me fix it.

@zhandaz zhandaz requested a review from wangshangsam January 25, 2026 01:28
@wangshangsam wangshangsam merged commit 72d77dc into mlperf-inf-mm-q3vl-v6.0 Jan 25, 2026
1 check passed
@zhandaz zhandaz deleted the zhanda/mlperf-inf-mm-q3vl-v6.0 branch January 25, 2026 17:14
zhandaz added a commit that referenced this pull request Feb 4, 2026
* [Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image

[Docker][Dev] Fix libnccl-dev version conflict for the CUDA 13.0.1 devel image

Further update

* feat: Support FA4 for mm-encoder-attn-backend for qwen models

* feat: Kernel warmup for vit fa4

* fix: Fix some minor conflicts due to the introduction of flash_attn.cute

* Revert "[Docker][Dev] Fix libnccl-dev version for the CUDA 13.0.1 devel image"

This reverts commit ab76b28.

* chore: Update requirements and revert README.md

* chore: Install git for flash_attn cute installation

* lint: Fix linting

* Revert "[Improvement] Persist CUDA compat libraries paths to prevent reset on `apt-get` (vllm-project#30784)" (#31)

This reverts commit 2a60ac9.

---------

Co-authored-by: Shang Wang <shangw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants