[Feature] Support prefix cache retention patch#10198
Conversation
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for local prefix-cache retention in vLLM-Ascend, backported from the core vLLM project. By enabling sparse sliding-window retention, the changes optimize memory usage for long-context inference tasks. This implementation ensures that KV cache management is more efficient by allowing selective retention of blocks rather than dense checkpointing, which is critical for maintaining performance in large-scale deployments. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\nmarkdown\n[Attention][Feature] Support prefix cache retention interval for sliding-window KV cache\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request implements support for prefix cache retention intervals for sliding-window KV cache groups, mirroring vLLM's `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable. It introduces the `patch_prefix_cache_retention.py` module to patch upstream coordinators, block pools, and single-type KV cache managers to respect the retention interval.\n\nSeveral critical issues were identified in the review:\n1. In `BlockStored` event generation, `token_ids` are contiguous even when some blocks are skipped via `block_mask`, leading to mismatched mapping of block hashes to token IDs.\n2. The reference count of the singleton null block is incorrectly decremented in `free_blocks`, which can corrupt reference counts.\n3. The singleton null block is not skipped in the `free` method of `SingleTypeKVCacheManager`, causing it to be queued for freeing.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it introduces support for the `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable to control local sliding-window KV checkpoint retention.\n\n### How was this patch tested?\nThe patch includes new unit tests in `test_prefix_cache_cp_patches.py` covering environment registration, interval validation, reachable block mask logic, and coordinator routing.\n
| if self.enable_kv_cache_events: | ||
| parent_block_hash = ( | ||
| None | ||
| if num_cached_blocks == 0 | ||
| else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1]) | ||
| ) | ||
| start_token_idx = num_cached_blocks * block_size | ||
| end_token_idx = num_full_blocks * block_size | ||
| extra_keys_list: list[tuple[Any, ...] | None] = [] | ||
| curr_mm_idx = 0 | ||
| for i in range(num_cached_blocks, num_full_blocks): | ||
| if not block_mask[i - num_cached_blocks] or blocks[i].is_null: | ||
| continue | ||
| block_start = i * block_size | ||
| block_end = block_start + block_size | ||
| extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys( | ||
| request, block_start, block_end, curr_mm_idx | ||
| ) | ||
| extra_keys_list.append(extra_keys) | ||
|
|
||
| self.kv_event_queue.append( | ||
| block_pool_mod.BlockStored( | ||
| block_hashes=new_hashes, | ||
| parent_block_hash=parent_block_hash, | ||
| token_ids=request.all_token_ids[start_token_idx:end_token_idx], | ||
| block_size=block_size, | ||
| lora_id=request.lora_request.adapter_id if request.lora_request else None, | ||
| medium=block_pool_mod.MEDIUM_GPU, | ||
| lora_name=request.lora_request.name if request.lora_request else None, | ||
| extra_keys=extra_keys_list if extra_keys_list else None, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
When block_mask is used, some blocks are skipped and not stored in the cache. However, the BlockStored event is currently constructed with a contiguous slice of token_ids from start_token_idx to end_token_idx. This creates a mismatch between the number of block_hashes (which only contains hashes of stored blocks) and the number of blocks represented by token_ids. Any downstream consumer of BlockStored (such as external cache managers or metrics collectors) that chunks token_ids by block_size will incorrectly map block hashes to the wrong token IDs.
To fix this, we should only include the token IDs of the blocks that were actually stored in token_ids.
if self.enable_kv_cache_events:
parent_block_hash = (
None
if num_cached_blocks == 0
else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1])
)
extra_keys_list: list[tuple[Any, ...] | None] = []
stored_token_ids = []
curr_mm_idx = 0
for i in range(num_cached_blocks, num_full_blocks):
if not block_mask[i - num_cached_blocks] or blocks[i].is_null:
continue
block_start = i * block_size
block_end = block_start + block_size
extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys(
request, block_start, block_end, curr_mm_idx
)
extra_keys_list.append(extra_keys)
stored_token_ids.extend(request.all_token_ids[block_start:block_end])
self.kv_event_queue.append(
block_pool_mod.BlockStored(
block_hashes=new_hashes,
parent_block_hash=parent_block_hash,
token_ids=stored_token_ids,
block_size=block_size,
lora_id=request.lora_request.adapter_id if request.lora_request else None,
medium=block_pool_mod.MEDIUM_GPU,
lora_name=request.lora_request.name if request.lora_request else None,
extra_keys=extra_keys_list if extra_keys_list else None,
)
)| blocks_list = list(ordered_blocks) | ||
| for block in blocks_list: | ||
| block.ref_cnt -= 1 | ||
|
|
||
| freed_blocks = [block for block in blocks_list if block.ref_cnt == 0 and not block.is_null] |
There was a problem hiding this comment.
In free_blocks, the ref_cnt of all blocks in ordered_blocks is decremented. However, ordered_blocks can contain the singleton null block (self._null_block), which is used for padding skipped blocks. Decrementing the ref_cnt of the null block will corrupt its reference count over time. We should only decrement the reference count for non-null blocks.
blocks_list = list(ordered_blocks)
for block in blocks_list:
if not block.is_null:
block.ref_cnt -= 1
freed_blocks = [block for block in blocks_list if not block.is_null and block.ref_cnt == 0]| def free(self: SingleTypeKVCacheManager, request_id: str) -> None: | ||
| req_blocks = self.req_to_blocks.pop(request_id, []) | ||
| if req_blocks: | ||
| cached_blocks: list[KVCacheBlock] = [] | ||
| uncached_blocks: list[KVCacheBlock] = [] | ||
| for block in reversed(req_blocks): | ||
| if block.block_hash is None: | ||
| uncached_blocks.append(block) | ||
| else: | ||
| cached_blocks.append(block) | ||
| self.block_pool.free_blocks(cached_blocks) | ||
| self.block_pool.free_blocks(uncached_blocks, prepend=True) | ||
| self.num_cached_block.pop(request_id, None) |
There was a problem hiding this comment.
In free, we iterate through req_blocks and categorize them into cached_blocks and uncached_blocks based on whether block_hash is None. However, req_blocks can contain the singleton null block (self._null_block), which has block_hash as None. This causes the null block to be added to uncached_blocks and passed to free_blocks. We should explicitly skip null blocks to avoid passing them to the free queue.
| def free(self: SingleTypeKVCacheManager, request_id: str) -> None: | |
| req_blocks = self.req_to_blocks.pop(request_id, []) | |
| if req_blocks: | |
| cached_blocks: list[KVCacheBlock] = [] | |
| uncached_blocks: list[KVCacheBlock] = [] | |
| for block in reversed(req_blocks): | |
| if block.block_hash is None: | |
| uncached_blocks.append(block) | |
| else: | |
| cached_blocks.append(block) | |
| self.block_pool.free_blocks(cached_blocks) | |
| self.block_pool.free_blocks(uncached_blocks, prepend=True) | |
| self.num_cached_block.pop(request_id, None) | |
| def free(self: SingleTypeKVCacheManager, request_id: str) -> None: | |
| req_blocks = self.req_to_blocks.pop(request_id, []) | |
| if req_blocks: | |
| cached_blocks: list[KVCacheBlock] = [] | |
| uncached_blocks: list[KVCacheBlock] = [] | |
| for block in reversed(req_blocks): | |
| if block.is_null: | |
| continue | |
| if block.block_hash is None: | |
| uncached_blocks.append(block) | |
| else: | |
| cached_blocks.append(block) | |
| self.block_pool.free_blocks(cached_blocks) | |
| self.block_pool.free_blocks(uncached_blocks, prepend=True) | |
| self.num_cached_block.pop(request_id, None) |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What
Backport the local prefix-cache retention support from vllm-project/vllm#43447 into vLLM-Ascend's platform patch layer for current vLLM 0.20.x based deployments.
This keeps the change separate from the AscendStore external-store retention work and focuses on local sliding-window prefix-cache retention semantics.
Why
Current vLLM-Ascend main still carries Ascend-specific KV cache coordinator and cache manager patches. For DeepSeek V4 long-context runs, those patches need to understand the same selective sliding-window retention behavior as vLLM core, otherwise local prefix-cache checkpointing can remain dense and over-retain blocks.
Changes
VLLM_PREFIX_CACHE_RETENTION_INTERVALfor current vLLM versions that do not expose it yet.BlockPoolmasked caching/freeing support used by sparse retention.Validation
vLLM version: v0.20.1
vLLM main: vllm-project/vllm@c7aa186