-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[V1][Spec Decode] KV cache slots for eagle heads #16370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -164,7 +164,8 @@ def allocate_slots( | |
| self, | ||
| request: Request, | ||
| num_tokens: int, | ||
| new_computed_blocks: Optional[list[KVCacheBlock]] = None | ||
| new_computed_blocks: Optional[list[KVCacheBlock]] = None, | ||
| num_spec_tokens: int = 0, | ||
| ) -> Optional[list[KVCacheBlock]]: | ||
| """Add slots for a request with new tokens to append. | ||
|
|
||
|
|
@@ -174,6 +175,9 @@ def allocate_slots( | |
| not include the tokens that have already been computed. | ||
| new_computed_blocks: A list of new computed blocks just hitting the | ||
| prefix caching. | ||
| num_spec_tokens: The number of speculative tokens to allocate. | ||
| This field is only used by eagle. We allocate the slots for | ||
| the propose heads. | ||
LiuXiaoxuanPKU marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Blocks layout: | ||
| ----------------------------------------------------------------------- | ||
|
|
@@ -211,8 +215,9 @@ def allocate_slots( | |
| # the new prefix caching hits | ||
| num_computed_tokens = (request.num_computed_tokens + | ||
| len(new_computed_blocks) * self.block_size) | ||
| num_required_blocks = cdiv(num_computed_tokens + num_tokens, | ||
| self.block_size) | ||
| num_required_blocks = cdiv( | ||
| num_computed_tokens + num_tokens + num_spec_tokens, | ||
| self.block_size) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @luyuzhe111 @wwl2755 - Moving the discussion of why this PR is expected to improve the AL. I have a hypothesis. Without this PR, the queries in the draft can go out of bounds in the block_table and pick up incorrect address and value which will corrupt the answer. block_table is used in FA cuda kernels and maybe we dont check illegal memory address access. Lets say page size is 16. This corruption will arise when have < K slots left in the last block. The preallocate block computation (extra 4 blocks) wont trigger in this case since the last block is not full. As K increases, the changes of this increases. So K=4 has higher chances of having this than K=2 which reflects here. But then block_table is gathered here too to form the slot_mapping for queries so out of index should have given an error which it did not when using bs=1 with MTBench so I am not sure if above hypothesis is correct. Lmk what you guys think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WoosukKwon @LiuXiaoxuanPKU - can you also share your insight as to why this PR is expected to increase AL? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. QQ: Is the statement "this PR can increase AL" already benchmarked OR is it set up as a goal of this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From a high level, if we don't have this PR, the current scheduler does not actually allocate slots for proposed tokens, they only allocate slots for verification. Therefore, it's not guaranteed the kv cache of the proposed heads is not contaminated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @LiuXiaoxuanPKU can you help us understand at a bit deeper level like which code line would be at fault? My understanding is that If the scheduler doesn't allocate slots for the proposed tokens then torch should have thrown some error here when the new proposed tokens become the query? However, it didnt happen in our MTBench benchmark so probably there is no corruption without this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for asking! here will not trigger an error because block_table is always of a tensor of shape |
||
| num_new_blocks = (num_required_blocks - len(req_blocks) - | ||
| len(new_computed_blocks)) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,7 +7,8 @@ | |
| from collections.abc import Iterable | ||
| from typing import Optional, Union | ||
|
|
||
| from vllm.config import CacheConfig, LoRAConfig, ModelConfig, SchedulerConfig | ||
| from vllm.config import (CacheConfig, LoRAConfig, ModelConfig, SchedulerConfig, | ||
| SpeculativeConfig) | ||
| from vllm.logger import init_logger | ||
| from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalRegistry | ||
| from vllm.v1.core.encoder_cache_manager import (EncoderCacheManager, | ||
|
|
@@ -38,6 +39,7 @@ def __init__( | |
| cache_config: CacheConfig, | ||
| lora_config: Optional[LoRAConfig], | ||
| kv_cache_config: KVCacheConfig, | ||
| speculative_config: SpeculativeConfig, | ||
| structured_output_manager: StructuredOutputManager, | ||
| mm_registry: MultiModalRegistry = MULTIMODAL_REGISTRY, | ||
| include_finished_set: bool = False, | ||
|
|
@@ -112,6 +114,10 @@ def __init__( | |
| self.encoder_cache_manager = EncoderCacheManager( | ||
| cache_size=encoder_cache_size) | ||
|
|
||
| self.num_spec_tokens = 0 | ||
LiuXiaoxuanPKU marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if speculative_config and speculative_config.method == "eagle": | ||
| self.num_spec_tokens = speculative_config.num_speculative_tokens | ||
|
|
||
| def schedule(self) -> SchedulerOutput: | ||
| # NOTE(woosuk) on the scheduling algorithm: | ||
| # There's no "decoding phase" nor "prefill phase" in the scheduler. | ||
|
|
@@ -188,7 +194,9 @@ def schedule(self) -> SchedulerOutput: | |
|
|
||
| while True: | ||
| new_blocks = self.kv_cache_manager.allocate_slots( | ||
| request, num_new_tokens) | ||
| request, | ||
| num_new_tokens, | ||
| num_spec_tokens=self.num_spec_tokens) | ||
|
||
| if new_blocks is None: | ||
| # The request cannot be scheduled. | ||
| # Preempt the lowest-priority request. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two points to discuss:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen this term
lookahead_tokensbefore. Can you share why this is more general thanspec_tokens? Is it because it can also mean jump tokens?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No jump tokens should be in new_tokens. I just feel num_spec_tokens is confusing because it actually means the spec tokens we're going to propose by the end of this step. However, we also have spec_tokens in Request, but that spec_tokens were generated by the last step for verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @comaniac I have the same two questions, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_lookahead_tokens, will change here.preallocated_blocks -= num_lookahead_tokens // block_sizeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have to revert this when num of draft tokens become large espc with tree attn since then num draft tokens ~= num preallocated tokens which would lead to frequent block allocations.