DO_NOT_COMMIT: benchmark fa2 vs tree attention by foolusion · Pull Request #23143 · vllm-project/vllm

foolusion · 2025-08-19T03:20:48Z

this is just to share the benchmark code in flash-attention repo, it was hacked together very quickly.

gemini-code-assist

Code Review

This pull request introduces benchmarking capabilities to compare tree attention with flash attention v2. The changes include updating the flash-attention dependency to a fork, adding benchmarking utilities to a test file, and modifying the flash attention backend to support tree attention metadata.

My review has identified a few critical issues:

The build configuration points to a non-official fork of flash-attention, which is a security concern.
A potential division-by-zero error exists in the benchmark's speedup calculation.
The attention sink functionality appears to be disabled in the flash attention backend, which could be a regression.

These issues should be addressed before considering this code for any use beyond temporary benchmarking.

gemini-code-assist · 2025-08-19T03:22:51Z

cmake/external_projects/vllm_flash_attn.cmake

+          GIT_REPOSITORY https://github.com/samsung-cnct/flash-attention.git
+          GIT_TAG feaab457d8d58243f19bf234a42a498647de0e6f


The dependency for vllm-flash-attn has been changed to a personal fork (samsung-cnct/flash-attention) from the official vllm-project/flash-attention. This introduces a potential security risk and dependency issue. While this might be intentional for benchmarking purposes in a "DO_NOT_COMMIT" context, it's a critical change that should not be merged into a production branch. Using unverified third-party forks can expose the project to vulnerabilities.

gemini-code-assist · 2025-08-19T03:22:51Z

vllm/v1/attention/backends/flash_attn.py

+                    k_descale=layer._k_scale.expand(descale_shape),
+                    v_descale=layer._v_scale.expand(descale_shape),
+                    num_splits=attn_metadata.max_num_splits,
+                    # s_aux=self.sinks,


The s_aux=self.sinks parameter has been commented out in the call to flash_attn_varlen_func. This appears to be a regression that will break functionality for models that use attention sinks. If this was intentional for the benchmark, it should be clearly documented why. Otherwise, it should be restored to prevent breaking existing models.

Suggested change

# s_aux=self.sinks,

s_aux=self.sinks,

gemini-code-assist · 2025-08-19T03:22:52Z

tests/v1/spec_decode/test_tree_attention.py

+                    print(f"fa2_tree_attention average time: {fa2_time:.6f} seconds")
+
+                    # Calculate speedup
+                    speedup = tree_time / fa2_time


The calculation speedup = tree_time / fa2_time could lead to a division by zero error if fa2_time is zero or very close to it. This can happen in benchmarks, especially with very fast operations or measurement noise. To prevent this, a small epsilon should be added to the denominator.

Suggested change

speedup = tree_time / fa2_time

speedup = tree_time / (fa2_time + 1e-9)

github-actions · 2025-08-19T04:23:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-08-26T21:20:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @foolusion.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-11-26T02:11:05Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-12-26T02:17:48Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

IdoCohenTD · 2026-01-01T15:19:05Z

Hey @foolusion, I tried using this PR when serving qwen3_32b with the eagle model from the speculators repo. The code runs and I can see that the propose_tree path is taken in the EagleProposer. For some reason the acceptance rate drops and indeed so does the generation time. The generation time can drop in some cases, but the acceptance rate should be lower bound by the regular acceptance rate, no?
Have you tried running this in a full vllm run?

foolusion · 2026-01-01T16:01:44Z

Hi @IdoCohenTD we have a private branch that has diverged pretty significantly from this code. I don't actually know if the code here is correct I was just swapping the attention mechanism for benchmark comparison.

Are you saying that the fa2 tree_attention is getting lower acceptance rate than the triton tree_attention? I hope our team can make our changes public soon. One thing to note with the acceptance rate is that the current implementation of the tree rejection sampler uses argmax to match draft tokens, we have topk and are working on topp which improve acceptance rate if you are willing to trade off of some accuracy.

Hoping to release the changes soon!

foolusion · 2026-01-01T16:12:50Z

@IdoCohenTD I think this is a better place to test the fa2 changes #25511

IdoCohenTD · 2026-01-04T11:50:50Z

@foolusion I only compared it to no tree attention. I am seeing a significant rise in the acceptance rate at position 0, but then a drop at position 1 and onward. I will take a look at #25511.
Hope to see your release soon!

benchmark fa2 vs tree attention

8b65bbd

foolusion requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 19, 2025 03:20

mergify bot added ci/build speculative-decoding v1 labels Aug 19, 2025

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

foolusion mentioned this pull request Aug 19, 2025

feat: implement tree attention mask support for FlashAttention-2 vllm-project/flash-attention#81

Closed

foolusion added 2 commits August 19, 2025 16:17

use benchmark from fa2 module

7a2ebfe

add simple propose for e2e test

98f31c6

mergify bot added the needs-rebase label Aug 26, 2025

github-actions bot added the stale Over 90 days of inactivity label Nov 26, 2025

github-actions bot closed this Dec 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DO_NOT_COMMIT: benchmark fa2 vs tree attention#23143

DO_NOT_COMMIT: benchmark fa2 vs tree attention#23143
foolusion wants to merge 3 commits intovllm-project:mainfrom
samsung-cnct:apo

foolusion commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Uh oh!

gemini-code-assist bot Aug 19, 2025

Uh oh!

gemini-code-assist bot Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

github-actions bot commented Dec 26, 2025

Uh oh!

IdoCohenTD commented Jan 1, 2026

Uh oh!

foolusion commented Jan 1, 2026

Uh oh!

foolusion commented Jan 1, 2026

Uh oh!

IdoCohenTD commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		GIT_REPOSITORY https://github.com/samsung-cnct/flash-attention.git
		GIT_TAG feaab457d8d58243f19bf234a42a498647de0e6f

	speedup = tree_time / fa2_time
	speedup = tree_time / (fa2_time + 1e-9)

Uh oh!

Conversation

foolusion commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

github-actions bot commented Dec 26, 2025

Uh oh!

IdoCohenTD commented Jan 1, 2026

Uh oh!

foolusion commented Jan 1, 2026

Uh oh!

foolusion commented Jan 1, 2026

Uh oh!

IdoCohenTD commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants