[Feature] Support prefix cache retention patch by Pz1116 · Pull Request #10198 · vllm-project/vllm-ascend

Pz1116 · 2026-06-08T13:18:34Z

What

Backport the local prefix-cache retention support from vllm-project/vllm#43447 into vLLM-Ascend's platform patch layer for current vLLM 0.20.x based deployments.

This keeps the change separate from the AscendStore external-store retention work and focuses on local sliding-window prefix-cache retention semantics.

Why

Current vLLM-Ascend main still carries Ascend-specific KV cache coordinator and cache manager patches. For DeepSeek V4 long-context runs, those patches need to understand the same selective sliding-window retention behavior as vLLM core, otherwise local prefix-cache checkpointing can remain dense and over-retain blocks.

Changes

Register VLLM_PREFIX_CACHE_RETENTION_INTERVAL for current vLLM versions that do not expose it yet.
Patch free-queue prepend and BlockPool masked caching/freeing support used by sparse retention.
Add sliding-window reachable-block retention masks following [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache vllm#43447.
Propagate retention interval through Ascend's hybrid KV cache coordinator.
Keep compressed MLA groups dense while allowing sliding-window groups to retain sparse checkpoints.
Add focused unit coverage for env registration, validation, SWA masks, and AscendHybrid propagation.

Validation

vLLM version: v0.20.1
vLLM main: vllm-project/vllm@c7aa186

vLLM version: v0.21.0
vLLM main: vllm-project/vllm@9090368

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>

github-actions · 2026-06-08T13:20:19Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-06-08T13:26:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for local prefix-cache retention in vLLM-Ascend, backported from the core vLLM project. By enabling sparse sliding-window retention, the changes optimize memory usage for long-context inference tasks. This implementation ensures that KV cache management is more efficient by allowing selective retention of blocks rather than dense checkpointing, which is critical for maintaining performance in large-scale deployments.

Highlights

Environment Configuration: Added the VLLM_PREFIX_CACHE_RETENTION_INTERVAL environment variable to allow fine-grained control over local prefix-cache retention.
Sparse Retention Logic: Implemented patching for BlockPool and FreeKVCacheBlockQueue to support sparse retention, enabling masked caching and efficient block freeing.
Sliding Window Optimization: Introduced sliding-window reachable-block masks to optimize KV cache usage, specifically targeting long-context models like DeepSeek V4.
Coordinator Integration: Updated the AscendHybridKVCacheCoordinator to propagate retention intervals and manage cache blocks based on the new sparse retention logic.
Test Coverage: Added comprehensive unit tests to validate environment registration, SWA mask generation, and the propagation of retention settings through the cache coordinator.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:\n\nmarkdown\n[Attention][Feature] Support prefix cache retention interval for sliding-window KV cache\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis pull request implements support for prefix cache retention intervals for sliding-window KV cache groups, mirroring vLLM's `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable. It introduces the `patch_prefix_cache_retention.py` module to patch upstream coordinators, block pools, and single-type KV cache managers to respect the retention interval.\n\nSeveral critical issues were identified in the review:\n1. In `BlockStored` event generation, `token_ids` are contiguous even when some blocks are skipped via `block_mask`, leading to mismatched mapping of block hashes to token IDs.\n2. The reference count of the singleton null block is incorrectly decremented in `free_blocks`, which can corrupt reference counts.\n3. The singleton null block is not skipped in the `free` method of `SingleTypeKVCacheManager`, causing it to be queued for freeing.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it introduces support for the `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` environment variable to control local sliding-window KV checkpoint retention.\n\n### How was this patch tested?\nThe patch includes new unit tests in `test_prefix_cache_cp_patches.py` covering environment registration, interval validation, reachable block mask logic, and coordinator routing.\n

gemini-code-assist · 2026-06-08T13:28:55Z

+        if self.enable_kv_cache_events:
+            parent_block_hash = (
+                None
+                if num_cached_blocks == 0
+                else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1])
+            )
+            start_token_idx = num_cached_blocks * block_size
+            end_token_idx = num_full_blocks * block_size
+            extra_keys_list: list[tuple[Any, ...] | None] = []
+            curr_mm_idx = 0
+            for i in range(num_cached_blocks, num_full_blocks):
+                if not block_mask[i - num_cached_blocks] or blocks[i].is_null:
+                    continue
+                block_start = i * block_size
+                block_end = block_start + block_size
+                extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys(
+                    request, block_start, block_end, curr_mm_idx
+                )
+                extra_keys_list.append(extra_keys)
+
+            self.kv_event_queue.append(
+                block_pool_mod.BlockStored(
+                    block_hashes=new_hashes,
+                    parent_block_hash=parent_block_hash,
+                    token_ids=request.all_token_ids[start_token_idx:end_token_idx],
+                    block_size=block_size,
+                    lora_id=request.lora_request.adapter_id if request.lora_request else None,
+                    medium=block_pool_mod.MEDIUM_GPU,
+                    lora_name=request.lora_request.name if request.lora_request else None,
+                    extra_keys=extra_keys_list if extra_keys_list else None,
+                )
+            )


When block_mask is used, some blocks are skipped and not stored in the cache. However, the BlockStored event is currently constructed with a contiguous slice of token_ids from start_token_idx to end_token_idx. This creates a mismatch between the number of block_hashes (which only contains hashes of stored blocks) and the number of blocks represented by token_ids. Any downstream consumer of BlockStored (such as external cache managers or metrics collectors) that chunks token_ids by block_size will incorrectly map block hashes to the wrong token IDs.

To fix this, we should only include the token IDs of the blocks that were actually stored in token_ids.

if self.enable_kv_cache_events: parent_block_hash = ( None if num_cached_blocks == 0 else block_pool_mod.maybe_convert_block_hash(block_hashes[num_cached_blocks - 1]) ) extra_keys_list: list[tuple[Any, ...] | None] = [] stored_token_ids = [] curr_mm_idx = 0 for i in range(num_cached_blocks, num_full_blocks): if not block_mask[i - num_cached_blocks] or blocks[i].is_null: continue block_start = i * block_size block_end = block_start + block_size extra_keys, curr_mm_idx = block_pool_mod.generate_block_hash_extra_keys( request, block_start, block_end, curr_mm_idx ) extra_keys_list.append(extra_keys) stored_token_ids.extend(request.all_token_ids[block_start:block_end]) self.kv_event_queue.append( block_pool_mod.BlockStored( block_hashes=new_hashes, parent_block_hash=parent_block_hash, token_ids=stored_token_ids, block_size=block_size, lora_id=request.lora_request.adapter_id if request.lora_request else None, medium=block_pool_mod.MEDIUM_GPU, lora_name=request.lora_request.name if request.lora_request else None, extra_keys=extra_keys_list if extra_keys_list else None, ) )

gemini-code-assist · 2026-06-08T13:28:55Z

+        blocks_list = list(ordered_blocks)
+        for block in blocks_list:
+            block.ref_cnt -= 1
+
+        freed_blocks = [block for block in blocks_list if block.ref_cnt == 0 and not block.is_null]


In free_blocks, the ref_cnt of all blocks in ordered_blocks is decremented. However, ordered_blocks can contain the singleton null block (self._null_block), which is used for padding skipped blocks. Decrementing the ref_cnt of the null block will corrupt its reference count over time. We should only decrement the reference count for non-null blocks.

blocks_list = list(ordered_blocks) for block in blocks_list: if not block.is_null: block.ref_cnt -= 1 freed_blocks = [block for block in blocks_list if not block.is_null and block.ref_cnt == 0]

gemini-code-assist · 2026-06-08T13:28:55Z

+    def free(self: SingleTypeKVCacheManager, request_id: str) -> None:
+        req_blocks = self.req_to_blocks.pop(request_id, [])
+        if req_blocks:
+            cached_blocks: list[KVCacheBlock] = []
+            uncached_blocks: list[KVCacheBlock] = []
+            for block in reversed(req_blocks):
+                if block.block_hash is None:
+                    uncached_blocks.append(block)
+                else:
+                    cached_blocks.append(block)
+            self.block_pool.free_blocks(cached_blocks)
+            self.block_pool.free_blocks(uncached_blocks, prepend=True)
+        self.num_cached_block.pop(request_id, None)


In free, we iterate through req_blocks and categorize them into cached_blocks and uncached_blocks based on whether block_hash is None. However, req_blocks can contain the singleton null block (self._null_block), which has block_hash as None. This causes the null block to be added to uncached_blocks and passed to free_blocks. We should explicitly skip null blocks to avoid passing them to the free queue.

Suggested change

def free(self: SingleTypeKVCacheManager, request_id: str) -> None:

req_blocks = self.req_to_blocks.pop(request_id, [])

if req_blocks:

cached_blocks: list[KVCacheBlock] = []

uncached_blocks: list[KVCacheBlock] = []

for block in reversed(req_blocks):

if block.block_hash is None:

uncached_blocks.append(block)

else:

cached_blocks.append(block)

self.block_pool.free_blocks(cached_blocks)

self.block_pool.free_blocks(uncached_blocks, prepend=True)

self.num_cached_block.pop(request_id, None)

def free(self: SingleTypeKVCacheManager, request_id: str) -> None:

req_blocks = self.req_to_blocks.pop(request_id, [])

if req_blocks:

cached_blocks: list[KVCacheBlock] = []

uncached_blocks: list[KVCacheBlock] = []

for block in reversed(req_blocks):

if block.is_null:

continue

if block.block_hash is None:

uncached_blocks.append(block)

else:

cached_blocks.append(block)

self.block_pool.free_blocks(cached_blocks)

self.block_pool.free_blocks(uncached_blocks, prepend=True)

self.num_cached_block.pop(request_id, None)

github-actions · 2026-06-11T14:54:47Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Support prefix cache retention patch

0e41d97

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>

github-actions Bot added module:tests module:core labels Jun 8, 2026

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

github-actions Bot added the merge-conflicts label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support prefix cache retention patch#10198

[Feature] Support prefix cache retention patch#10198
Pz1116 wants to merge 1 commit into
vllm-project:mainfrom
Pz1116:codex/prefix-cache-retention-main

Pz1116 commented Jun 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pz1116 commented Jun 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Validation

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot commented Jun 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pz1116 commented Jun 8, 2026 •

edited by github-actions Bot

Loading