Skip to content

DO_NOT_COMMIT: benchmark fa2 vs tree attention#23143

Closed
foolusion wants to merge 3 commits intovllm-project:mainfrom
samsung-cnct:apo
Closed

DO_NOT_COMMIT: benchmark fa2 vs tree attention#23143
foolusion wants to merge 3 commits intovllm-project:mainfrom
samsung-cnct:apo

Conversation

@foolusion
Copy link
Copy Markdown

this is just to share the benchmark code in flash-attention repo, it was hacked together very quickly.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces benchmarking capabilities to compare tree attention with flash attention v2. The changes include updating the flash-attention dependency to a fork, adding benchmarking utilities to a test file, and modifying the flash attention backend to support tree attention metadata.

My review has identified a few critical issues:

  1. The build configuration points to a non-official fork of flash-attention, which is a security concern.
  2. A potential division-by-zero error exists in the benchmark's speedup calculation.
  3. The attention sink functionality appears to be disabled in the flash attention backend, which could be a regression.

These issues should be addressed before considering this code for any use beyond temporary benchmarking.

Comment on lines +40 to +41
GIT_REPOSITORY https://github.com/samsung-cnct/flash-attention.git
GIT_TAG feaab457d8d58243f19bf234a42a498647de0e6f
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dependency for vllm-flash-attn has been changed to a personal fork (samsung-cnct/flash-attention) from the official vllm-project/flash-attention. This introduces a potential security risk and dependency issue. While this might be intentional for benchmarking purposes in a "DO_NOT_COMMIT" context, it's a critical change that should not be merged into a production branch. Using unverified third-party forks can expose the project to vulnerabilities.

k_descale=layer._k_scale.expand(descale_shape),
v_descale=layer._v_scale.expand(descale_shape),
num_splits=attn_metadata.max_num_splits,
# s_aux=self.sinks,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The s_aux=self.sinks parameter has been commented out in the call to flash_attn_varlen_func. This appears to be a regression that will break functionality for models that use attention sinks. If this was intentional for the benchmark, it should be clearly documented why. Otherwise, it should be restored to prevent breaking existing models.

Suggested change
# s_aux=self.sinks,
s_aux=self.sinks,

print(f"fa2_tree_attention average time: {fa2_time:.6f} seconds")

# Calculate speedup
speedup = tree_time / fa2_time
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation speedup = tree_time / fa2_time could lead to a division by zero error if fa2_time is zero or very close to it. This can happen in benchmarks, especially with very fast operations or measurement noise. To prevent this, a small epsilon should be added to the denominator.

Suggested change
speedup = tree_time / fa2_time
speedup = tree_time / (fa2_time + 1e-9)

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify
Copy link
Copy Markdown

mergify bot commented Aug 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @foolusion.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 26, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Nov 26, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Dec 26, 2025
@IdoCohenTD
Copy link
Copy Markdown

Hey @foolusion, I tried using this PR when serving qwen3_32b with the eagle model from the speculators repo. The code runs and I can see that the propose_tree path is taken in the EagleProposer. For some reason the acceptance rate drops and indeed so does the generation time. The generation time can drop in some cases, but the acceptance rate should be lower bound by the regular acceptance rate, no?
Have you tried running this in a full vllm run?

@foolusion
Copy link
Copy Markdown
Author

Hi @IdoCohenTD we have a private branch that has diverged pretty significantly from this code. I don't actually know if the code here is correct I was just swapping the attention mechanism for benchmark comparison.

Are you saying that the fa2 tree_attention is getting lower acceptance rate than the triton tree_attention? I hope our team can make our changes public soon. One thing to note with the acceptance rate is that the current implementation of the tree rejection sampler uses argmax to match draft tokens, we have topk and are working on topp which improve acceptance rate if you are willing to trade off of some accuracy.

Hoping to release the changes soon!

@foolusion
Copy link
Copy Markdown
Author

@IdoCohenTD I think this is a better place to test the fa2 changes #25511

@IdoCohenTD
Copy link
Copy Markdown

@foolusion I only compared it to no tree attention. I am seeing a significant rise in the acceptance rate at position 0, but then a drop at position 1 and onward. I will take a look at #25511.
Hope to see your release soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants