[Kernel] FA4 Integration. by zyongye · Pull Request #26371 · vllm-project/vllm

zyongye · 2025-10-07T18:24:26Z

Ongoing integration for Flash Attention 4.

How to run

# Clone FA repo and install FA4
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/flash_attn/cute
uv pip install -v . --no-build-isolation
# run on vllm
VLLM_FLASH_ATTN_VERSION=4 vllm serve Qwen/Qwen3-0.6B --block-size 128

Accuracy bench
openai/gpt-oss-20b, GPQA:

Reasoning effort	Score
Low	56.6
Medium	67.0

Perf Benchmark, flashinfer is significantly better than FA4 for now, probably because splitkv hasn't been implemented yet:
Qwen3-0.6B, 1000:1000x256
FA4:

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  31.72     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              8.07      
Output token throughput (tok/s):         8071.34   
Peak output token throughput (tok/s):    10261.00  
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          16142.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          562.07    
Median TTFT (ms):                        526.62    
P99 TTFT (ms):                           1021.21   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.93     
Median TPOT (ms):                        30.97     
P99 TPOT (ms):                           31.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.94     
Median ITL (ms):                         30.77     
P99 ITL (ms):                            39.04     
==================================================

FlashInfer

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  12.12     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              21.13     
Output token throughput (tok/s):         21129.71  
Peak output token throughput (tok/s):    26496.00  
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          42259.43  
---------------Time to First Token----------------
Mean TTFT (ms):                          501.19    
Median TTFT (ms):                        483.11    
P99 TTFT (ms):                           806.13    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.49     
Median TPOT (ms):                        11.52     
P99 TPOT (ms):                           11.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.50     
Median ITL (ms):                         11.40     
P99 ITL (ms):                            18.56     
==================================================

8000:1x256
FA4

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  5.52      
Total input tokens:                      2048000   
Total generated tokens:                  256       
Request throughput (req/s):              46.36     
Output token throughput (tok/s):         46.36     
Peak output token throughput (tok/s):    51.00     
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          370944.79 
---------------Time to First Token----------------
Mean TTFT (ms):                          2922.54   
Median TTFT (ms):                        2938.03   
P99 TTFT (ms):                           5372.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P99 ITL (ms):                            0.00      
==================================================

Flashinfer

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  5.80      
Total input tokens:                      2048000   
Total generated tokens:                  256       
Request throughput (req/s):              44.14     
Output token throughput (tok/s):         44.14     
Peak output token throughput (tok/s):    49.00     
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          353135.59 
---------------Time to First Token----------------
Mean TTFT (ms):                          3049.76   
Median TTFT (ms):                        3049.41   
P99 TTFT (ms):                           5666.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P99 ITL (ms):                            0.00      
==================================================

mergify · 2025-10-07T18:25:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-13T15:59:46Z

vllm/v1/attention/backends/flash_attn.py

    @classmethod
    def get_supported_head_sizes(cls) -> list[int]:
+        # FIXME (zyongye): change this until FA4 support more head_dim
+        if envs.VLLM_FLASH_ATTN_VERSION == 4:
+            return [64, 96, 128]
        return [32, 64, 96, 128, 160, 192, 224, 256]

    @staticmethod
    def get_supported_kernel_block_size() -> list[int | MultipleOf]:
+        if envs.VLLM_FLASH_ATTN_VERSION == 4:
+            return [128]


Accept FA4 in flash attention version selection

These new branches rely on VLLM_FLASH_ATTN_VERSION == 4 to activate FA4-specific behavior, but get_flash_attn_version still asserts that the environment variable is only 2 or 3. Setting VLLM_FLASH_ATTN_VERSION=4 to reach this code path currently triggers an AssertionError during backend initialization, so the FA4 code here is unreachable and the feature cannot be enabled. The version-selection logic needs to be updated to admit 4 (and handle unsupported hardware) before these branches will ever execute.

Useful? React with 👍 / 👎.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify · 2025-10-14T17:12:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

Thanks for doing this! I think it's fine to import from flash_attn upstream for now since it's currently only activated by an env var.

Apologies for the delay but will update our fork soon (just need #24002)

LucasWilkinson · 2025-10-14T18:30:58Z

vllm/attention/utils/fa_utils.py

        assert device_capability is not None

        # 1. default version depending on platform
        fa_version = (


should we update this if we are requiring the env var to be set to enable FA4?

you're right. I will revert it for now.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2025-10-14T19:12:00Z

I also have this PR to add FA4 to supported version in FA repo.

mergify · 2025-11-11T16:57:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2026-02-10T02:17:52Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

zyongye · 2026-03-02T00:10:12Z

closed due to #32974

mergify bot added v1 needs-rebase labels Oct 7, 2025

simon-mo force-pushed the fa4 branch from d4e4c5a to 70ba578 Compare October 10, 2025 03:15

mergify bot removed the needs-rebase label Oct 10, 2025

zyongye force-pushed the fa4 branch from 209a6af to 3cdce3e Compare October 13, 2025 01:30

zyongye added 7 commits October 13, 2025 08:47

init commit on fa4

263f14a

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fix block size

359b9cd

fix accuracy

d22f590

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

aviod copy output tensor

8500ae6

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

format

d41bac2

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

revive output buffer

b0171df

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

.

6ee4444

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the fa4 branch from 72bd92b to 6ee4444 Compare October 13, 2025 15:47

cg work

9f79055

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye marked this pull request as ready for review October 13, 2025 15:56

zyongye requested a review from LucasWilkinson as a code owner October 13, 2025 15:56

chatgpt-codex-connector bot reviewed Oct 13, 2025

View reviewed changes

zyongye added 2 commits October 13, 2025 09:09

update fa_util

2d84b30

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update sink constraint

ddd610a

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye changed the title ~~FA4 Integration.~~ [Kernel] FA4 Integration. Oct 14, 2025

mergify bot added the needs-rebase label Oct 14, 2025

LucasWilkinson approved these changes Oct 14, 2025

View reviewed changes

zyongye added 2 commits October 14, 2025 11:46

Merge remote-tracking branch 'upstream/main' into fa4

a72a19b

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change alibi messages

890a784

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify bot removed the needs-rebase label Oct 14, 2025

zyongye added 2 commits October 14, 2025 11:50

revert default selection of fa

7170786

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change dcp error message

eec021c

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

github-actions bot added the stale Over 90 days of inactivity label Feb 10, 2026

zyongye closed this Mar 2, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] FA4 Integration. #26371

[Kernel] FA4 Integration. #26371
zyongye wants to merge 14 commits intovllm-project:mainfrom
zyongye:fa4

zyongye commented Oct 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 13, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

LucasWilkinson Oct 14, 2025

Uh oh!

zyongye Oct 14, 2025

Uh oh!

zyongye commented Oct 14, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

zyongye commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zyongye commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

zyongye Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

zyongye commented Oct 14, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

zyongye commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zyongye commented Oct 7, 2025 •

edited by github-actions bot

Loading