Skip to content

[Kernel] FA4 Integration. #26371

Closed
zyongye wants to merge 14 commits intovllm-project:mainfrom
zyongye:fa4
Closed

[Kernel] FA4 Integration. #26371
zyongye wants to merge 14 commits intovllm-project:mainfrom
zyongye:fa4

Conversation

@zyongye
Copy link
Member

@zyongye zyongye commented Oct 7, 2025

Ongoing integration for Flash Attention 4.

How to run

# Clone FA repo and install FA4
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/flash_attn/cute
uv pip install -v . --no-build-isolation
# run on vllm
VLLM_FLASH_ATTN_VERSION=4 vllm serve Qwen/Qwen3-0.6B --block-size 128

Accuracy bench
openai/gpt-oss-20b, GPQA:

Reasoning effort Score
Low 56.6
Medium 67.0

Perf Benchmark, flashinfer is significantly better than FA4 for now, probably because splitkv hasn't been implemented yet:
Qwen3-0.6B, 1000:1000x256
FA4:

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  31.72     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              8.07      
Output token throughput (tok/s):         8071.34   
Peak output token throughput (tok/s):    10261.00  
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          16142.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          562.07    
Median TTFT (ms):                        526.62    
P99 TTFT (ms):                           1021.21   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.93     
Median TPOT (ms):                        30.97     
P99 TPOT (ms):                           31.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.94     
Median ITL (ms):                         30.77     
P99 ITL (ms):                            39.04     
==================================================

FlashInfer

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  12.12     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              21.13     
Output token throughput (tok/s):         21129.71  
Peak output token throughput (tok/s):    26496.00  
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          42259.43  
---------------Time to First Token----------------
Mean TTFT (ms):                          501.19    
Median TTFT (ms):                        483.11    
P99 TTFT (ms):                           806.13    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.49     
Median TPOT (ms):                        11.52     
P99 TPOT (ms):                           11.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.50     
Median ITL (ms):                         11.40     
P99 ITL (ms):                            18.56     
==================================================

8000:1x256
FA4

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  5.52      
Total input tokens:                      2048000   
Total generated tokens:                  256       
Request throughput (req/s):              46.36     
Output token throughput (tok/s):         46.36     
Peak output token throughput (tok/s):    51.00     
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          370944.79 
---------------Time to First Token----------------
Mean TTFT (ms):                          2922.54   
Median TTFT (ms):                        2938.03   
P99 TTFT (ms):                           5372.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P99 ITL (ms):                            0.00      
==================================================

Flashinfer

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  5.80      
Total input tokens:                      2048000   
Total generated tokens:                  256       
Request throughput (req/s):              44.14     
Output token throughput (tok/s):         44.14     
Peak output token throughput (tok/s):    49.00     
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          353135.59 
---------------Time to First Token----------------
Mean TTFT (ms):                          3049.76   
Median TTFT (ms):                        3049.41   
P99 TTFT (ms):                           5666.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P99 ITL (ms):                            0.00      
==================================================

@mergify
Copy link

mergify bot commented Oct 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye marked this pull request as ready for review October 13, 2025 15:56
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines 56 to +66
@classmethod
def get_supported_head_sizes(cls) -> list[int]:
# FIXME (zyongye): change this until FA4 support more head_dim
if envs.VLLM_FLASH_ATTN_VERSION == 4:
return [64, 96, 128]
return [32, 64, 96, 128, 160, 192, 224, 256]

@staticmethod
def get_supported_kernel_block_size() -> list[int | MultipleOf]:
if envs.VLLM_FLASH_ATTN_VERSION == 4:
return [128]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept FA4 in flash attention version selection

These new branches rely on VLLM_FLASH_ATTN_VERSION == 4 to activate FA4-specific behavior, but get_flash_attn_version still asserts that the environment variable is only 2 or 3. Setting VLLM_FLASH_ATTN_VERSION=4 to reach this code path currently triggers an AssertionError during backend initialization, so the FA4 code here is unreachable and the feature cannot be enabled. The version-selection logic needs to be updated to admit 4 (and handle unsupported hardware) before these branches will ever execute.

Useful? React with 👍 / 👎.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye changed the title FA4 Integration. [Kernel] FA4 Integration. Oct 14, 2025
@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think it's fine to import from flash_attn upstream for now since it's currently only activated by an env var.

Apologies for the delay but will update our fork soon (just need #24002)

assert device_capability is not None

# 1. default version depending on platform
fa_version = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update this if we are requiring the env var to be set to enable FA4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I will revert it for now.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@mergify mergify bot removed the needs-rebase label Oct 14, 2025
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye
Copy link
Member Author

zyongye commented Oct 14, 2025

I also have this PR to add FA4 to supported version in FA repo.

@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Feb 10, 2026
@zyongye
Copy link
Member Author

zyongye commented Mar 2, 2026

closed due to #32974

@zyongye zyongye closed this Mar 2, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase nvidia stale Over 90 days of inactivity v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants