Skip to content

[Feature] Support prefix cache retention patch#10198

Draft
Pz1116 wants to merge 1 commit into
vllm-project:mainfrom
Pz1116:codex/prefix-cache-retention-main
Draft

[Feature] Support prefix cache retention patch#10198
Pz1116 wants to merge 1 commit into
vllm-project:mainfrom
Pz1116:codex/prefix-cache-retention-main

Conversation

@Pz1116

@Pz1116 Pz1116 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Backport the local prefix-cache retention support from vllm-project/vllm#43447 into vLLM-Ascend's platform patch layer for current vLLM 0.20.x based deployments.

This keeps the change separate from the AscendStore external-store retention work and focuses on local sliding-window prefix-cache retention semantics.

Why

Current vLLM-Ascend main still carries Ascend-specific KV cache coordinator and cache manager patches. For DeepSeek V4 long-context runs, those patches need to understand the same selective sliding-window retention behavior as vLLM core, otherwise local prefix-cache checkpointing can remain dense and over-retain blocks.

Changes

  • Register VLLM_PREFIX_CACHE_RETENTION_INTERVAL for current vLLM versions that do not expose it yet.
  • Patch free-queue prepend and BlockPool masked caching/freeing support used by sparse retention.
  • Add sliding-window reachable-block retention masks following [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache vllm#43447.
  • Propagate retention interval through Ascend's hybrid KV cache coordinator.
  • Keep compressed MLA groups dense while allowing sliding-window groups to retain sparse checkpoints.
  • Add focused unit coverage for env registration, validation, SWA masks, and AscendHybrid propagation.

Validation

vLLM version: v0.20.1
vLLM main: vllm-project/vllm@c7aa186

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for local prefix-cache retention in vLLM-Ascend, backported from the core vLLM project. By enabling sparse sliding-window retention, the changes optimize memory usage for long-context inference tasks. This implementation ensures that KV cache management is more efficient by allowing selective retention of blocks rather than dense checkpointing, which is critical for maintaining performance in large-scale deployments.

Highlights

  • Environment Configuration: Added the VLLM_PREFIX_CACHE_RETENTION_INTERVAL environment variable to allow fine-grained control over local prefix-cache retention.
  • Sparse Retention Logic: Implemented patching for BlockPool and FreeKVCacheBlockQueue to support sparse retention, enabling masked caching and efficient block freeing.
  • Sliding Window Optimization: Introduced sliding-window reachable-block masks to optimize KV cache usage, specifically targeting long-context models like DeepSeek V4.
  • Coordinator Integration: Updated the AscendHybridKVCacheCoordinator to propagate retention intervals and manage cache blocks based on the new sparse retention logic.
  • Test Coverage: Added comprehensive unit tests to validate environment registration, SWA mask generation, and the propagation of retention settings through the cache coordinator.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:\n\nmarkdown\n[Attention][Feature] Support prefix cache retention interval for sliding-window KV cache\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request implements support for prefix cache retention intervals for sliding-window KV cache groups, mirroring vLLM's `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable. It introduces the `patch_prefix_cache_retention.py` module to patch upstream coordinators, block pools, and single-type KV cache managers to respect the retention interval.\n\nSeveral critical issues were identified in the review:\n1. In `BlockStored` event generation, `token_ids` are contiguous even when some blocks are skipped via `block_mask`, leading to mismatched mapping of block hashes to token IDs.\n2. The reference count of the singleton null block is incorrectly decremented in `free_blocks`, which can corrupt reference counts.\n3. The singleton null block is not skipped in the `free` method of `SingleTypeKVCacheManager`, causing it to be queued for freeing.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it introduces support for the `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable to control local sliding-window KV checkpoint retention.\n\n### How was this patch tested?\nThe patch includes new unit tests in `test_prefix_cache_cp_patches.py` covering environment registration, interval validation, reachable block mask logic, and coordinator routing.\n

Comment on lines +190 to +221
if self.enable_kv_cache_events:
parent_block_hash = (
None
if num_cached_blocks == 0
else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1])
)
start_token_idx = num_cached_blocks * block_size
end_token_idx = num_full_blocks * block_size
extra_keys_list: list[tuple[Any, ...] | None] = []
curr_mm_idx = 0
for i in range(num_cached_blocks, num_full_blocks):
if not block_mask[i - num_cached_blocks] or blocks[i].is_null:
continue
block_start = i * block_size
block_end = block_start + block_size
extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys(
request, block_start, block_end, curr_mm_idx
)
extra_keys_list.append(extra_keys)

self.kv_event_queue.append(
block_pool_mod.BlockStored(
block_hashes=new_hashes,
parent_block_hash=parent_block_hash,
token_ids=request.all_token_ids[start_token_idx:end_token_idx],
block_size=block_size,
lora_id=request.lora_request.adapter_id if request.lora_request else None,
medium=block_pool_mod.MEDIUM_GPU,
lora_name=request.lora_request.name if request.lora_request else None,
extra_keys=extra_keys_list if extra_keys_list else None,
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When block_mask is used, some blocks are skipped and not stored in the cache. However, the BlockStored event is currently constructed with a contiguous slice of token_ids from start_token_idx to end_token_idx. This creates a mismatch between the number of block_hashes (which only contains hashes of stored blocks) and the number of blocks represented by token_ids. Any downstream consumer of BlockStored (such as external cache managers or metrics collectors) that chunks token_ids by block_size will incorrectly map block hashes to the wrong token IDs.

To fix this, we should only include the token IDs of the blocks that were actually stored in token_ids.

        if self.enable_kv_cache_events:
            parent_block_hash = (
                None
                if num_cached_blocks == 0
                else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1])
            )
            extra_keys_list: list[tuple[Any, ...] | None] = []
            stored_token_ids = []
            curr_mm_idx = 0
            for i in range(num_cached_blocks, num_full_blocks):
                if not block_mask[i - num_cached_blocks] or blocks[i].is_null:
                    continue
                block_start = i * block_size
                block_end = block_start + block_size
                extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys(
                    request, block_start, block_end, curr_mm_idx
                )
                extra_keys_list.append(extra_keys)
                stored_token_ids.extend(request.all_token_ids[block_start:block_end])

            self.kv_event_queue.append(
                block_pool_mod.BlockStored(
                    block_hashes=new_hashes,
                    parent_block_hash=parent_block_hash,
                    token_ids=stored_token_ids,
                    block_size=block_size,
                    lora_id=request.lora_request.adapter_id if request.lora_request else None,
                    medium=block_pool_mod.MEDIUM_GPU,
                    lora_name=request.lora_request.name if request.lora_request else None,
                    extra_keys=extra_keys_list if extra_keys_list else None,
                )
            )

Comment on lines +122 to +126
blocks_list = list(ordered_blocks)
for block in blocks_list:
block.ref_cnt -= 1

freed_blocks = [block for block in blocks_list if block.ref_cnt == 0 and not block.is_null]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In free_blocks, the ref_cnt of all blocks in ordered_blocks is decremented. However, ordered_blocks can contain the singleton null block (self._null_block), which is used for padding skipped blocks. Decrementing the ref_cnt of the null block will corrupt its reference count over time. We should only decrement the reference count for non-null blocks.

        blocks_list = list(ordered_blocks)
        for block in blocks_list:
            if not block.is_null:
                block.ref_cnt -= 1

        freed_blocks = [block for block in blocks_list if not block.is_null and block.ref_cnt == 0]

Comment on lines +399 to +411
def free(self: SingleTypeKVCacheManager, request_id: str) -> None:
req_blocks = self.req_to_blocks.pop(request_id, [])
if req_blocks:
cached_blocks: list[KVCacheBlock] = []
uncached_blocks: list[KVCacheBlock] = []
for block in reversed(req_blocks):
if block.block_hash is None:
uncached_blocks.append(block)
else:
cached_blocks.append(block)
self.block_pool.free_blocks(cached_blocks)
self.block_pool.free_blocks(uncached_blocks, prepend=True)
self.num_cached_block.pop(request_id, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In free, we iterate through req_blocks and categorize them into cached_blocks and uncached_blocks based on whether block_hash is None. However, req_blocks can contain the singleton null block (self._null_block), which has block_hash as None. This causes the null block to be added to uncached_blocks and passed to free_blocks. We should explicitly skip null blocks to avoid passing them to the free queue.

Suggested change
def free(self: SingleTypeKVCacheManager, request_id: str) -> None:
req_blocks = self.req_to_blocks.pop(request_id, [])
if req_blocks:
cached_blocks: list[KVCacheBlock] = []
uncached_blocks: list[KVCacheBlock] = []
for block in reversed(req_blocks):
if block.block_hash is None:
uncached_blocks.append(block)
else:
cached_blocks.append(block)
self.block_pool.free_blocks(cached_blocks)
self.block_pool.free_blocks(uncached_blocks, prepend=True)
self.num_cached_block.pop(request_id, None)
def free(self: SingleTypeKVCacheManager, request_id: str) -> None:
req_blocks = self.req_to_blocks.pop(request_id, [])
if req_blocks:
cached_blocks: list[KVCacheBlock] = []
uncached_blocks: list[KVCacheBlock] = []
for block in reversed(req_blocks):
if block.is_null:
continue
if block.block_hash is None:
uncached_blocks.append(block)
else:
cached_blocks.append(block)
self.block_pool.free_blocks(cached_blocks)
self.block_pool.free_blocks(uncached_blocks, prepend=True)
self.num_cached_block.pop(request_id, None)

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant