[Attention] FA3 Attention Sinks Perf Boost by LucasWilkinson · Pull Request #22478 · vllm-project/vllm

LucasWilkinson · 2025-08-08T01:21:39Z

Purpose

vLLM side of vllm-project/flash-attention#78 (merge that first)

Shout-out to @jayhshah (the performance wizard 🪄) for the implementation

Co-authored-by: Jay Shah jayhshah@gmail.com

Test Plan

Test Result

vllm bench serve --dataset-name random --random-input-len=1000 --random-output-len=100 --num-prompts 1000 --port 3333 --model openai/gpt-oss-20b --request-rate 100

## PR

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           100.00    
Benchmark duration (s):                  11.17     
Total input tokens:                      998750    
Total generated tokens:                  98124     
Request throughput (req/s):              89.54     
Output token throughput (tok/s):         8785.70   
Total Token throughput (tok/s):          98210.51  
---------------Time to First Token----------------
Mean TTFT (ms):                          48.26     
Median TTFT (ms):                        47.00     
P99 TTFT (ms):                           66.62     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.87     
Median TPOT (ms):                        15.27     
P99 TPOT (ms):                           15.52     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.87     
Median ITL (ms):                         15.05     
P99 ITL (ms):                            19.59     
==================================================

## Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           100.00    
Benchmark duration (s):                  11.38     
Total input tokens:                      998750    
Total generated tokens:                  98134     
Request throughput (req/s):              87.84     
Output token throughput (tok/s):         8619.95   
Total Token throughput (tok/s):          96348.73  
---------------Time to First Token----------------
Mean TTFT (ms):                          69.56     
Median TTFT (ms):                        56.66     
P99 TTFT (ms):                           472.49    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.57     
Median TPOT (ms):                        17.07     
P99 TPOT (ms):                           25.69     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.49     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            27.75     
==================================================

(Optional) Documentation Update

github-actions · 2025-08-08T01:21:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request updates the flash-attention dependency to a newer commit. Based on the PR description and the commit hash, this update is intended to bring in support for FlashAttention v3. The provided patch only reflects this dependency change in the CMake configuration. The full file contents suggest there are other related changes in the Python source code to support FA3 and also to introduce a new FlashMLA backend. Since I can only comment on the provided patch, my review is limited to the dependency update itself, which seems correct.

However, while reviewing the full code for context, I found a critical issue in vllm/attention/backends/flashmla.py that I believe is part of this change.

In vllm/attention/backends/flashmla.py, the __init__ method of FlashMLAImpl has the following assertion:

assert is_flashmla_supported(), \
    "FlashMLA is not supported on this device"

The function is_flashmla_supported() returns a tuple (bool, Optional[str]). If FlashMLA is not supported, it returns (False, "reason string"). In Python, a non-empty tuple is truthy, so assert (False, "reason") will pass silently. This will lead to a runtime error later. This is a critical bug.

I suggest changing it to:

supported, reason = is_flashmla_supported()
if not supported:
    raise NotImplementedError(f"FlashMLA is not supported on this device: {reason}")

This will correctly check for support and provide a helpful error message.

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added the ci/build label Aug 8, 2025

LucasWilkinson changed the title ~~update vllm-FA~~ [WIP][Attention] FA3 Attention Sinks Perf Boost Aug 8, 2025

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

LucasWilkinson force-pushed the lwilkinson/attn-sink-perf-boost branch from a02895f to 642cb08 Compare August 9, 2025 04:07

LucasWilkinson changed the title ~~[WIP][Attention] FA3 Attention Sinks Perf Boost~~ [Attention] FA3 Attention Sinks Perf Boost Aug 9, 2025

LucasWilkinson marked this pull request as ready for review August 9, 2025 04:08

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2025

simon-mo approved these changes Aug 9, 2025

View reviewed changes

LucasWilkinson force-pushed the lwilkinson/attn-sink-perf-boost branch from 642cb08 to 8bee389 Compare August 15, 2025 05:47

update vllm-FA

d1b81c5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lwilkinson/attn-sink-perf-boost branch from 8bee389 to d1b81c5 Compare August 15, 2025 15:27

LucasWilkinson merged commit 177e55e into vllm-project:main Aug 15, 2025
72 checks passed

666even666 pushed a commit to 666even666/vllm that referenced this pull request Aug 18, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

a7f67d3

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

c817115

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

d07caa6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

9b6683f

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

86b67ff

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

7c58999

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Attention] FA3 Attention Sinks Perf Boost (vllm-project#22478)

cbff0e7

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] FA3 Attention Sinks Perf Boost#22478

[Attention] FA3 Attention Sinks Perf Boost#22478
LucasWilkinson merged 1 commit intovllm-project:mainfrom
neuralmagic:lwilkinson/attn-sink-perf-boost

LucasWilkinson commented Aug 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LucasWilkinson commented Aug 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasWilkinson commented Aug 8, 2025 •

edited by github-actions bot

Loading