-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Add tree attention backend for v1 (part 1) #20401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @TheEpicDolphin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates the initial phase of a Tree Attention backend into v1 of the attention system, specifically to support EAGLE speculative decoding. The changes enable the efficient validation of draft tokens by implementing a tree-based attention mechanism that correctly applies necessary attention biases. This work involves significant additions to the attention backend infrastructure, updates to model architecture to utilize the new backend, and includes a correctness test to ensure functionality.
Highlights
- New Tree Attention Backend: Introduced
TreeAttentionBackendandTreeAttentionImplto add support for tree attention, which is a key component for EAGLE speculative decoding in v1 of the attention system. - Attention Bias Implementation: The
TreeAttentionImplleveragesxformers.ops.tree_attentionand correctly applies both prefix and speculative (suffix) attention biases, essential for managing attention between draft tokens and their ancestors or prompt tokens. - Dynamic Backend Selection and Draft Model Support: The attention backend selection logic has been updated to include
TREE_ATTNand now incorporates anis_draftflag, allowing the system to differentiate and select appropriate attention backends for draft models within the speculative decoding framework. - Optimized Batch Processing: A new
TreeAttentionMetadataBuilderwas added to reorder batches, prioritizing decode requests, and to efficiently construct attention metadata for both prefill (handled by FlashAttention) and speculative decode phases. - Correctness Validation: A new test,
test_tree_attn_correctness, was implemented to verify the numerical correctness of theTreeAttentionBackendby comparing its output againstFlashAttentionBackendacross various configurations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new TreeAttentionBackend for speculative decoding, which is a significant feature addition. The implementation is well-structured, reusing FlashAttentionImpl for prefill requests and using xformers for the tree attention part. The new test file provides good coverage for correctness verification.
I've identified a critical issue with duplicated fields in a dataclass and a few medium-severity issues related to code correctness, performance, and maintainability. Addressing these will improve the quality and robustness of the new backend. Overall, this is a great first step towards enabling tree attention.
5a37c78 to
bfa883a
Compare
3ff7ebe to
da6c40b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
sgrigory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for integrating tree attention! Left a few comments. Regarding the performance, maybe look at the profiles to see what takes the most time - it could be the tree attention itself, but it could also be metadata processing (which we can then take out of decoding loop, at least partially)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simulates a situation when pages are actually ordered contiguously in physical memory. Would the test also work in a more complex scenario? For example, you can swap two pages
or even shuffle them all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
vllm/engine/arg_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is the comment above "No XFormers so far" still true if you are importing tree attention from xFormers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just call it XFORMERS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave prefix_op as None and rely on the heuristic https://github.com/facebookresearch/xformers/blob/80250b32516b019b72bb44be04ca9a8741b42faa/xformers/ops/tree_attention.py#L469C5-L469C21 to choose the prefix op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this at first, but received the following error:
File "/data/users/gdelfin/gitrepos/vllm/vllm/v1/attention/backends/tree_attn.py", line 515, in forward
output[:num_decode_tokens] = tree_attention(
^^^^^^^^^^^^^^^
File "/home/gdelfin/.conda/envs/py312conda/lib/python3.12/site-packages/xformers/ops/tree_attention.py", line 606, in tree_attention
prefix_op = select_prefix_op(
^^^^^^^^^^^^^^^^^
File "/home/gdelfin/.conda/envs/py312conda/lib/python3.12/site-packages/xformers/ops/tree_attention.py", line 491, in select_prefix_op
fa3_supported = isinstance(attn_bias, flash3.FwOp.SUPPORTED_ATTN_BIAS_TYPES) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union
It seems like the current xformers version (0.0.30) in the vllm/requirements/cuda.txt file has a type error that needs to be fixed to enable prefix_op heuristic. Either that, or we bump the required xformers version. I can look into doing this later after finishing up the xformers v1 backend!
|
cc @bottler |
|
Failure is due to flakey test discussed in https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1754127415660409. It is not caused by this PR. Will need help to force-merge this |
thanks for the heads up, i'll look into this test issue Edit: I have a draft PR with the fix here #22207. Will publish shortly |
|
@DarkLight1337 Fix is ready here: #22207 |
|
I have a question. vLLM uses mixed scheduling for the prefill and decode stages, but your current operator completely separates prefill and decode. So, as I understand it, when validating in the main model (and there is a new prefill request at that time), it cannot be handled properly, right? |
Signed-off-by: Giancarlo Delfin <[email protected]>
|
I believe that this method: https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/tree_attn.py#L186 |
Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Giancarlo Delfin <[email protected]>
Signed-off-by: Giancarlo Delfin <[email protected]>
|
Part 2 which enables end-to-end support for tree spec decoding in V1 is here: #22752 |
Purpose
Add support for tree attention v1 backend. Tree attention is used in EAGLE speculative decoding by the target model to validate a set of draft tokens. Draft tokens only attend to ancestor tokens, and so attention bias must be used to omit attention between non-descendant tokens. To suppor that, I added a new parameter to triton
unified_attentioncalledqq_bias. This parameter enables applying query-on-query attention bias using a 2D (q_len, q_len) tensor. This feature is only enabled if a non-None value is provided for that parameter. Otherwise, it is disabled (the default case).I also implemented the logic for tree draft proposal, in Eagle.py. For chain drafts, it behaves the same as before. However, if a tree of speculative tokens is specified (via the Speculative Config), then this system can leverage TreeAttentionBackend for drafting. Top-K is used to select the drafted child tokens at each level of the tree.
NOTE: This PR does NOT change the existing behavior of v1 EAGLE. It simply adds the capability to use the TreeAttentionBackend, which can validate a tree of draft tokens. However, since tree scoring is still not implemented (I am working on it right now), only chain drafts are supported at this moment. But this is the first step to unlocking tree drafting and scoring functionality!
Test Plan
Benchmark
In addition, I used the following command to run the LLM service and benchmark TreeAttentionBackend vs FlashAttentionBackend:
Server
Client
Results
This benchmarking helped me verify that this PR did NOT regress performance on v1 spec decoding.
Improvements still need to be made for tree attention. I will investigate further on how to close the gap.
Manual Testing
Used the code below to send a completion request to the vLLM service running with TREE_ATTN backend:
Flash Attention Output
Tree Attention Output
Tree Drafts
I tested generating a tree with the following structure:
Represented by the following list of tuples:
For the input prompt, "Explain the theory of relativity in simple terms.", the backend proposed the following speculative tokens:
And also for the input prompt, "Write the first line of a novel that doesn’t exist yet.":
The paths in either draft trees are coherent.
NOTE: There is currently no way to sample tokens from a tree, so when this token tree was used, only the first few tokens were ever accepted.
Eagle Test
Added test case for tree attention backend. All pass:
Tree Attention Correctness Test
Also added a test case to test_attention_backends for tree attention.
Tree Attention vs Triton Attention
Given that tree attention backend currently uses triton attention under the hood, but with a custom query-on-query tree attention bias, I decided to measure the performance difference between the two, for various batch sizes, sequence lengths, and query lengths. In this case, Seqlen Q could represent the tree of tokens that is being validated by the target model. The Here are the results:
This demonstrates that the addition of the custom, tree attention bias does not significantly regress the overall performance. I expect that the increase in avg accepted token length from tree draft tokens will more than compensate for the minor increase in attention latency.
TODOs
The following actions still need to be taken to fully enable this backend:
As of this diff, only chain drafts are supported by TreeAttentionBackend. This is because EagleProposer still only generates draft chains.