[Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache#43447
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a sparse local retention mechanism for prefix caching, specifically targeting hybrid KV cache models. It adds a new configuration option, prefix_cache_retention_interval, which allows the system to retain sliding-window KV checkpoints at specified token intervals or at the latest prompt boundary. Key changes include updates to the block pool and free queue to support prioritized block reuse, and modifications to the KV cache managers to handle the lifecycle of these retained checkpoint blocks. I have no feedback to provide as there were no review comments.
ac5c6f4 to
12741ac
Compare
12741ac to
bccfd67
Compare
ivanium
left a comment
There was a problem hiding this comment.
Thanks for the PR! This is a valuable direction. I left a few comments. Happy to discuss.
| logger.warning( | ||
| "--prefix-cache-retention-interval is only effective when " | ||
| "prefix caching is enabled. This flag is ignored." | ||
| ) |
There was a problem hiding this comment.
Maybe also set self.self.cache_config.prefix_cache_retention_interval = None here for safety
| has_sliding_window_group = any( | ||
| isinstance(manager, SlidingWindowManager) | ||
| for manager in self.single_type_managers | ||
| ) |
There was a problem hiding this comment.
Just wanted to leave a note that this should also work for Mamba groups
There was a problem hiding this comment.
I suggest we revisit Mamba in a separate PR to limit the scope of this PR, especially we need to align with the behaviors of the different Mamba modes e.g., "align", "all" etc
| self.num_cached_block.pop(request_id, None) | ||
| return | ||
|
|
||
| ordered_blocks = list(reversed(req_blocks)) |
There was a problem hiding this comment.
Minor nit: this should be unnecessary as free_blocks() handles it.
On a second thought, I feel here we can prepend blocks without block_hash to the free list, similar to the logic below in the kv cache manager
There was a problem hiding this comment.
actually the changes here are legacy of my initial changes which performs the similar logic as the sliding window - prepend non-cache and append cached.
I eventually simplified it because I think the non-cached blocks should be already handled by remove_skipped_blocks before free is called. Let me check whether there is still need to do that here as well.
There was a problem hiding this comment.
I've added the similar logic to free of the SWA KV manager.
| @@ -456,9 +503,16 @@ def remove_skipped_blocks( | |||
| # should also have been set to null blocks by the previous calls | |||
| # to this function. | |||
| break | |||
| removed_blocks.append(blocks[i]) | |||
| if blocks[i].block_hash is None: | |||
| removed_uncached_blocks.append(blocks[i]) | |||
| else: | |||
| removed_cached_blocks.append(blocks[i]) | |||
| blocks[i] = self._null_block | |||
| self.block_pool.free_blocks(removed_blocks) | |||
| # `prepend=True` makes uncached scratch blocks the next allocation | |||
| # candidates, while cached blocks stay behind them as best-effort | |||
| # prefix-cache entries. | |||
| self.block_pool.free_blocks(removed_cached_blocks) | |||
| self.block_pool.free_blocks(removed_uncached_blocks, prepend=True) | |||
There was a problem hiding this comment.
This part is nice, and I think it actually helps the current DSV4 even without this retention mechanism. This is because we have masked out non-256-boundary SWA blocks so they won't have block_hash, but previously we didn't put it into the front of the free list so they may not be re-allocate the first.
| # optionally cache the latest prompt boundary. It is fixed for the | ||
| # lifetime of the request (derived from num_prompt_tokens), so cache it | ||
| # at most once to avoid redundant tail-mask work on every decode step. | ||
| if ( | ||
| latest_boundary_token is not None | ||
| and latest_boundary_token <= num_tokens | ||
| and request_id not in self._latest_retention_cached | ||
| ): | ||
| self._cache_tail_at_boundary(request, latest_boundary_token) | ||
| self._latest_retention_cached.add(request_id) |
There was a problem hiding this comment.
This is what we discussed offline on caching at the end of each turn's prompt, and I personally think this is the most helpful part
|
Hi there, Does this work with mtp? I'm noticing a 0% cache prefix hit rate with mtp enabled. |
Thanks for raising the issue. It's a good point. I think this PR doesn't support MTP yet because MTP will additionally drop a block. But this should be fixable. |
fcf7434 to
fbd05dd
Compare
|
@dafeliton I just added support for MTP. Could you test it again? |
641c1b6 to
7c909f8
Compare
|
I've updated the PR with suggestions from @ivanium. It now drops the CLI knob For now, a good default value for |
The custom cquil/vllm-openai image integrates vllm-project/vllm#43447, which fixes the DSv4 sliding-window prefix-cache eviction issue. But the fix is opt-in via VLLM_PREFIX_CACHE_RETENTION_INTERVAL — without setting it, vllm falls back to the legacy cache-every-segment path that this PR was written to repair, so the trace-replay cache hit rate stays near 0% even though the patched code is loaded. Sets the env var to 32768 (32k tokens), matching the value the PR author validated to take cache hit rate from 0% -> 74% on a comparable agentic trace-replay benchmark. On stock vllm images that don't carry the patch, the env var is simply ignored — safe to land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ention-interval env 7ead0a0 only carried the "Prepend uncached blocks in SWA free()" hunk of PR vllm-project/vllm#43447 — it did NOT modify vllm/envs.py to register the VLLM_PREFIX_CACHE_RETENTION_INTERVAL env var. That registration didn't land until commit 7c909f8 in the PR, and 6c529f30 is the latest merge of main into the PR branch. Effect: the export in dsv4_fp4_b300_vllm.sh (1bccc5c) finally takes effect — vllm stops logging "Unknown vLLM environment variable detected" and actually activates the SWA prefix-cache retention path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…v4 SWA Backports the core mechanism of vllm-project/vllm#43447 ("Selective prefix-cache retention for sliding-window KV cache") onto the v0.20.2rc vllm base used by vllm-ascend. Mechanism: when sliding-window blocks roll out of the window, the manager hands them back to the free queue. Baseline behavior appends them all to the back, where in concurrent long-context workloads the *uncached* scratch blocks of a new request push out *cached* prefix blocks of older requests. The fix: - Uncached (block_hash is None) -> prepend to free queue front - Cached -> append (default) to queue back On vllm main's DSv4 trace replay this lifts prefix_cache_hit from 0% to 74.3% under 16-concurrency 1M-context traffic and gives 3-4x throughput plus ~19x TTFT improvement. Implementation: - vllm_ascend/patch/worker/patch_prefix_cache_retention.py monkey-patches four call sites under VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION env gate: 1. FreeKVCacheBlockQueue.prepend_n (new method) 2. BlockPool.free_blocks (adds prepend=False kwarg) 3. SlidingWindowManager.remove_skipped_blocks (split cached/uncached) 4. SlidingWindowManager.free (same split on request free) - Env switch defaults to 0: when off, all paths route to the original methods bit-for-bit, so accuracy is unaffected. - Targets SlidingWindowManager only (DSv4 SWA cache + indexer compressor register SlidingWindowMLASpec which inherits SlidingWindowSpec). CompressAttentionManager (MLAAttentionSpec) and FullAttentionManager paths are untouched. Selective retention (VLLM_PREFIX_CACHE_RETENTION_INTERVAL sparse SWA tail checkpointing) is intentionally NOT in this first pass; the free-queue ordering change alone captures most of the hit-rate win in PR #43447's benchmark. Activation: VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION=1 vllm serve ... Expected effect on DSv4 + 1M-context concurrent traces: prefix cache hit rate goes from near-0% to 70%+, throughput 3-4x. Co-authored-by: Claude
…v4 SWA Backports the core mechanism of vllm-project/vllm#43447 ("Selective prefix-cache retention for sliding-window KV cache") onto the v0.20.2rc vllm base used by vllm-ascend. Mechanism: when sliding-window blocks roll out of the window, the manager hands them back to the free queue. Baseline behavior appends them all to the back, where in concurrent long-context workloads the *uncached* scratch blocks of a new request push out *cached* prefix blocks of older requests. The fix: - Uncached (block_hash is None) -> prepend to free queue front - Cached -> append (default) to queue back On vllm main's DSv4 trace replay this lifts prefix_cache_hit from 0% to 74.3% under 16-concurrency 1M-context traffic and gives 3-4x throughput plus ~19x TTFT improvement. Implementation: - vllm_ascend/patch/worker/patch_prefix_cache_retention.py monkey-patches four call sites under VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION env gate: 1. FreeKVCacheBlockQueue.prepend_n (new method) 2. BlockPool.free_blocks (adds prepend=False kwarg) 3. SlidingWindowManager.remove_skipped_blocks (split cached/uncached) 4. SlidingWindowManager.free (same split on request free) - Env switch defaults to 0: when off, all paths route to the original methods bit-for-bit, so accuracy is unaffected. - Targets SlidingWindowManager only (DSv4 SWA cache + indexer compressor register SlidingWindowMLASpec which inherits SlidingWindowSpec). CompressAttentionManager (MLAAttentionSpec) and FullAttentionManager paths are untouched. Selective retention (VLLM_PREFIX_CACHE_RETENTION_INTERVAL sparse SWA tail checkpointing) is intentionally NOT in this first pass; the free-queue ordering change alone captures most of the hit-rate win in PR #43447's benchmark. Activation: VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION=1 vllm serve ... Expected effect on DSv4 + 1M-context concurrent traces: prefix cache hit rate goes from near-0% to 70%+, throughput 3-4x. Co-authored-by: Claude
…n for sliding-window KV cache (vllm-project#43447) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: JisoLya <523420504@qq.com>
Resolves the recurring envs.py merge conflict per docs/superpowers/specs/2026-05-14-envs-merge-conflict-resolution-design.md. The legacy `if TYPE_CHECKING:` block and `environment_variables: dict[str, Callable]` runtime mapping were dropped on the branch in favor of pydantic `*Settings(BaseSettings)` subclasses. Every main-side edit to either location therefore conflicts mechanically; structural resolution is `--ours` for vllm/envs.py, then port the semantic delta as new `Field(...)` declarations on the appropriate sub-model. Main-side commits since merge base afcb580, with port disposition: - c73b0d0 (vllm-project#44669) — adds VLLM_RAY_DP_PLACEMENT_NODE_IPS (str=""). Ported to DistributedSettings.ray_dp_placement_node_ips. - 165b786 (vllm-project#40426) — adds VLLM_ROCM_USE_AITER_LINEAR_HIPBMM (bool=False). Ported to RocmSettings.rocm_use_aiter_linear_hipbmm. Native pydantic bool parsing replaces the `.lower() in ("true","1")` lambda. - 38fd240 (vllm-project#41980) — adds VLLM_DISTRIBUTED_USE_SPLIT_GROUP (bool=False). Ported to DistributedSettings.distributed_use_split_group. Native pydantic bool parsing replaces the `bool(int(...))` lambda. - a618356 (vllm-project#43447) — adds VLLM_PREFIX_CACHE_RETENTION_INTERVAL (int|None=None, tri-state). Ported to ServerSettings.prefix_cache_retention_interval; pydantic's unset-vs-explicit-zero handling matches the original `"X" in os.environ` guard. - bd98e97 (vllm-project#44128) — removes dead VLLM_RPC_TIMEOUT. Mirrored on the branch by deleting ServerSettings.rpc_timeout. Verification: vllm.envs imports cleanly; all four new vars read defaults and parse env-set values (incl. tri-state INTERVAL=0); VLLM_RPC_TIMEOUT correctly raises AttributeError; pre-commit passes ruff/format/mypy. Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Backport selective prefix-cache retention for sliding-window KV cache (vllm-project/vllm#43447) onto vllm-ascend releases/v0.20.2rc. Mechanism: when sliding-window blocks roll out of the window or a request finishes, the manager hands them back to the free queue. Under concurrent long-context workloads, the *uncached* scratch blocks of a new request flush away the *cached* prefix blocks of older requests because both sit in the same queue and the cached ones are older. This patch installs the upstream mechanism via import-time monkey-patches on a single file: vllm_ascend/patch/platform/patch_kv_cache_coordinator.py What gets patched: 1. FreeKVCacheBlockQueue.prepend_n -- new method (queue-head insert). 2. BlockPool.cache_full_blocks -- adds block_mask=None kwarg. 3. BlockPool.free_blocks -- adds prepend=False kwarg. 4. SingleTypeKVCacheManager.cache_blocks -- adds retention_interval + alignment_tokens kwargs. 5. SingleTypeKVCacheManager.remove_skipped_blocks -- split cached / uncached on window-slide free. 6. SlidingWindowManager._contiguous_blocks_for_hit -- helper. 7. SlidingWindowManager.reachable_block_mask -- core selective retention algorithm (segment tails + replay boundary tail). 8. SlidingWindowManager.free -- split cached / uncached on request finish. 9. MambaManager.cache_blocks -- signature sync (passthrough). 10. CompressAttentionManager.cache_blocks -- signature sync + divide num_tokens by compress_ratio for the DSv4 indexer path. Plus, on AscendHybridKVCacheCoordinator: - __init__ reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL, validates, and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata() pre-seeds per-manager _prefix_cache_alignment_tokens and _prefix_cache_use_eagle (the latter set true for EAGLE/MTP groups). - cache_blocks() override threads local_kv_retention_interval through to every manager. Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed interval (must be a multiple of alignment) VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only the latest replayable prompt boundary unset -> dense legacy behavior Accuracy: untouched. All patches fall through to the original methods when the env is unset, so unset preserves bit-for-bit behavior. When set, only the set of blocks that land in the global hash->block dict changes - block contents and the rest of the cache lookup path are unchanged. Effect (vllm-project/vllm#43447 trace replay, GPU side): prefix_cache_hit 0% -> 74.3% under 16-concurrency 1M-context traffic TTFT p50 86 s -> 4.5 s (~19x) input throughput 11k tok/s -> 37k tok/s (~3.4x)
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts where dense SWA caching has older requests' cached blocks flushed by concurrent requests' scratch allocations. The reference mechanism comes from vllm-project/vllm#43447 (which lands the same mechanism but inside the vllm core). This change is NOT a mechanical mirror of #43447 -- the deltas matter: Delivery surface ---------------- vllm #43447: edits five vllm core files (envs.py, block_pool.py, kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py) and ships them in a vllm release. here: single-file edit on vllm-ascend, all hooks installed via import-time monkey-patches with hasattr() guards. No vllm source change required; works against the unmodified vllm v0.20.2 pinned by v0.20.2rc. Decoupling from vllm -------------------- AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports are wrapped in a try/except. When vllm lacks #43447 symbols (current v0.20.2 case), we fall back to literal 1024 / 32768 + the local SlidingWindowManager so the monkey-patch installs cleanly anyway. Result: the env switch works on vllm v0.20.2 today, and will continue to work transparently if vllm later ships #43447. Ascend-specific signature work that #43447 does not need -------------------------------------------------------- - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //= compress_ratio before delegating to super(). This is the DSv4 indexer path; vllm #43447 has no equivalent because it does not have a compressor manager. - MambaManager.cache_blocks: signature sync (transparent passthrough) so the new retention_interval / alignment_tokens kwargs do not raise TypeError when the coordinator threads them through every manager. - SingleTypeKVCacheManager.cache_blocks: retention_interval AND alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid groups carry their own per-manager alignment, distinct from the coordinator's lcm_block_size). - SingleTypeKVCacheManager.reachable_block_mask base hook returning None (safe default) so any non-SWA manager that did not override it stays transparent. AscendHybridKVCacheCoordinator integration ------------------------------------------ - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is None, validates against ascend's lcm_block_size (not the upstream scheduler_block_size which does not exist on v0.20.2), and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata: pre-seeds every manager with _prefix_cache_alignment_tokens (= lcm_block_size) and _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries EAGLE handling inside SlidingWindowManager directly; vllm-ascend's multi-group EAGLE detection lives at the coordinator level, hence the per-manager metadata seeding here. - cache_blocks override: threads local_kv_retention_interval to every manager.cache_blocks call so the new retention path is reachable from the existing ascend manager classes. Implementation detail of block_mask handling -------------------------------------------- vllm #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks. Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily marks masked-out blocks with .is_null = True, delegates to the unmodified original (which already has an is_null skip path), and restores the flag in finally. Same semantic outcome, no vllm core change required. What gets patched (single file, all in vllm_ascend/patch/platform/patch_kv_cache_coordinator.py): 1. FreeKVCacheBlockQueue.prepend_n -- new method. 2. BlockPool.cache_full_blocks -- adds block_mask=None via is_null delegation trick. 3. BlockPool.free_blocks -- adds prepend=False. 4. SingleTypeKVCacheManager.cache_blocks -- adds retention_interval + alignment_tokens kwargs. 5. SingleTypeKVCacheManager.remove_skipped_blocks -- split cached/uncached on window-slide free. 6. SlidingWindowManager._contiguous_blocks_for_hit -- helper. 7. SlidingWindowManager.reachable_block_mask -- selective retention algo. 8. SlidingWindowManager.free -- split cached/uncached on request finish. 9. MambaManager.cache_blocks -- signature sync. 10. CompressAttentionManager.cache_blocks -- signature sync + DSv4 compress_ratio handling. Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed token interval VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only the latest replayable prompt boundary unset -> dense legacy behavior Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls through to the original on None / no-mask). When set, only the set of blocks recorded in the global hash->block dict changes -- block contents and the cache lookup path are unchanged.
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts where dense SWA caching has older requests' cached blocks flushed by concurrent requests' scratch allocations. The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea inside vllm core). This change is NOT a mechanical mirror of that PR -- the deltas matter: Delivery surface ---------------- #43447 edits five vllm core files (envs.py, block_pool.py, kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py) and ships them in a vllm release. here: eight vllm-ascend files; all hooks installed via import-time monkey-patches with hasattr() guards. No vllm source change required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc. Decoupling from vllm -------------------- AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports are wrapped in try/except. When vllm lacks #43447 symbols (the current v0.20.2 case), the patch falls back to literal 1024 / 32768 and the local SlidingWindowManager so the monkey-patch installs cleanly. The env switch works on vllm v0.20.2 today, and keeps working transparently if vllm later ships #43447. Ascend-specific signature work that #43447 does not need -------------------------------------------------------- - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //= compress_ratio before delegating to super(). DSv4 indexer path; #43447 has no equivalent because it has no compressor. - MambaManager.cache_blocks: signature sync (transparent passthrough) so the new retention_interval / alignment_tokens kwargs do not raise TypeError when the coordinator threads them through every manager. - SingleTypeKVCacheManager.cache_blocks: retention_interval AND alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid groups carry their own per-manager alignment, distinct from the coordinator's lcm_block_size). - SingleTypeKVCacheManager.reachable_block_mask base hook returning None (safe default) so non-SWA managers stay transparent. AscendHybridKVCacheCoordinator integration ------------------------------------------ - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is None, validates against ascend's lcm_block_size (not the upstream scheduler_block_size which does not exist on v0.20.2), and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata pre-seeds every manager with _prefix_cache_alignment_tokens (= lcm_block_size) and _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries EAGLE handling inside SlidingWindowManager directly; vllm-ascend's multi-group EAGLE detection lives at the coordinator level, hence the per-manager metadata seeding here. - cache_blocks override threads local_kv_retention_interval to every manager.cache_blocks call. Implementation detail of block_mask handling -------------------------------------------- #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks. Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily marks masked-out blocks with .is_null = True, delegates to the unmodified original (which already has an is_null skip path), and restores the flag in finally. Same semantic outcome, no vllm core change required. What got changed (eight files): vllm_ascend/core/single_type_kv_cache_manager.py (+87 / -9) vllm_ascend/models/layer/attention/layer.py (+4 / -2) vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6) vllm_ascend/patch/platform/patch_kv_cache_interface.py (+79 / -6) vllm_ascend/patch/platform/patch_kv_cache_utils.py (+36 / -4) vllm_ascend/patch/worker/patch_deepseek_compressor.py (+5 / -2) vllm_ascend/worker/block_table.py (+5 / -6) vllm_ascend/worker/model_runner_v1.py (+24 / -10) Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed token interval VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only latest replay tail unset -> dense legacy behavior Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls through to the original on None / no-mask). When set, only the set of blocks recorded in the global hash->block dict changes -- block contents and the cache lookup path are unchanged.
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts where dense SWA caching has older requests' cached blocks flushed by concurrent requests' scratch allocations. The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea inside vllm core). This change is NOT a mechanical mirror of that PR -- the deltas matter: Delivery surface ---------------- #43447 edits five vllm core files (envs.py, block_pool.py, kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py) and ships them in a vllm release. here: eight vllm-ascend files; all hooks installed via import-time monkey-patches with hasattr() guards. No vllm source change required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc. Decoupling from vllm -------------------- AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports are wrapped in try/except. When vllm lacks #43447 symbols (the current v0.20.2 case), the patch falls back to literal 1024 / 32768 and the local SlidingWindowManager so the monkey-patch installs cleanly. The env switch works on vllm v0.20.2 today, and keeps working transparently if vllm later ships #43447. Ascend-specific signature work that #43447 does not need -------------------------------------------------------- - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //= compress_ratio before delegating to super(). DSv4 indexer path; #43447 has no equivalent because it has no compressor. - MambaManager.cache_blocks: signature sync (transparent passthrough) so the new retention_interval / alignment_tokens kwargs do not raise TypeError when the coordinator threads them through every manager. - SingleTypeKVCacheManager.cache_blocks: retention_interval AND alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid groups carry their own per-manager alignment, distinct from the coordinator's lcm_block_size). - SingleTypeKVCacheManager.reachable_block_mask base hook returning None (safe default) so non-SWA managers stay transparent. AscendHybridKVCacheCoordinator integration ------------------------------------------ - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is None, validates against ascend's lcm_block_size (not the upstream scheduler_block_size which does not exist on v0.20.2), and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata pre-seeds every manager with _prefix_cache_alignment_tokens (= lcm_block_size) and _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries EAGLE handling inside SlidingWindowManager directly; vllm-ascend's multi-group EAGLE detection lives at the coordinator level, hence the per-manager metadata seeding here. - cache_blocks override threads local_kv_retention_interval to every manager.cache_blocks call. Implementation detail of block_mask handling -------------------------------------------- #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks. Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily marks masked-out blocks with .is_null = True, delegates to the unmodified original (which already has an is_null skip path), and restores the flag in finally. Same semantic outcome, no vllm core change required. What got changed (eight files): vllm_ascend/core/single_type_kv_cache_manager.py (+87 / -9) vllm_ascend/models/layer/attention/layer.py (+4 / -2) vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6) vllm_ascend/patch/platform/patch_kv_cache_interface.py (+79 / -6) vllm_ascend/patch/platform/patch_kv_cache_utils.py (+36 / -4) vllm_ascend/patch/worker/patch_deepseek_compressor.py (+5 / -2) vllm_ascend/worker/block_table.py (+5 / -6) vllm_ascend/worker/model_runner_v1.py (+24 / -10) Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed token interval VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only latest replay tail unset -> dense legacy behavior Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls through to the original on None / no-mask). When set, only the set of blocks recorded in the global hash->block dict changes -- block contents and the cache lookup path are unchanged.
…n for sliding-window KV cache (vllm-project#43447) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts where dense SWA caching has older requests' cached blocks flushed by concurrent requests' scratch allocations. The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea inside vllm core). This change is NOT a mechanical mirror of that PR -- the deltas matter: Delivery surface ---------------- #43447 edits five vllm core files (envs.py, block_pool.py, kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py) and ships them in a vllm release. here: eight vllm-ascend files; all hooks installed via import-time monkey-patches with hasattr() guards. No vllm source change required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc. Decoupling from vllm -------------------- AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports are wrapped in try/except. When vllm lacks #43447 symbols (the current v0.20.2 case), the patch falls back to literal 1024 / 32768 and the local SlidingWindowManager so the monkey-patch installs cleanly. The env switch works on vllm v0.20.2 today, and keeps working transparently if vllm later ships #43447. Ascend-specific signature work that #43447 does not need -------------------------------------------------------- - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //= compress_ratio before delegating to super(). DSv4 indexer path; #43447 has no equivalent because it has no compressor. - MambaManager.cache_blocks: signature sync (transparent passthrough) so the new retention_interval / alignment_tokens kwargs do not raise TypeError when the coordinator threads them through every manager. - SingleTypeKVCacheManager.cache_blocks: retention_interval AND alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid groups carry their own per-manager alignment, distinct from the coordinator's lcm_block_size). - SingleTypeKVCacheManager.reachable_block_mask base hook returning None (safe default) so non-SWA managers stay transparent. AscendHybridKVCacheCoordinator integration ------------------------------------------ - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is None, validates against ascend's lcm_block_size (not the upstream scheduler_block_size which does not exist on v0.20.2), and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata pre-seeds every manager with _prefix_cache_alignment_tokens (= lcm_block_size) and _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries EAGLE handling inside SlidingWindowManager directly; vllm-ascend's multi-group EAGLE detection lives at the coordinator level, hence the per-manager metadata seeding here. - cache_blocks override threads local_kv_retention_interval to every manager.cache_blocks call. Implementation detail of block_mask handling -------------------------------------------- #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks. Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily marks masked-out blocks with .is_null = True, delegates to the unmodified original (which already has an is_null skip path), and restores the flag in finally. Same semantic outcome, no vllm core change required. What got changed (eight files): vllm_ascend/core/single_type_kv_cache_manager.py (+87 / -9) vllm_ascend/models/layer/attention/layer.py (+4 / -2) vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6) vllm_ascend/patch/platform/patch_kv_cache_interface.py (+79 / -6) vllm_ascend/patch/platform/patch_kv_cache_utils.py (+36 / -4) vllm_ascend/patch/worker/patch_deepseek_compressor.py (+5 / -2) vllm_ascend/worker/block_table.py (+5 / -6) vllm_ascend/worker/model_runner_v1.py (+24 / -10) Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed token interval VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only latest replay tail unset -> dense legacy behavior Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls through to the original on None / no-mask). When set, only the set of blocks recorded in the global hash->block dict changes -- block contents and the cache lookup path are unchanged. Signed-off-by: liuchenbing <chenliumail@163.com>
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts where dense SWA caching has older requests' cached blocks flushed by concurrent requests' scratch allocations. The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea inside vllm core). This change is NOT a mechanical mirror of that PR -- the deltas matter: Delivery surface ---------------- #43447 edits five vllm core files (envs.py, block_pool.py, kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py) and ships them in a vllm release. here: eight vllm-ascend files; all hooks installed via import-time monkey-patches with hasattr() guards. No vllm source change required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc. Decoupling from vllm -------------------- AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports are wrapped in try/except. When vllm lacks #43447 symbols (the current v0.20.2 case), the patch falls back to literal 1024 / 32768 and the local SlidingWindowManager so the monkey-patch installs cleanly. The env switch works on vllm v0.20.2 today, and keeps working transparently if vllm later ships #43447. Ascend-specific signature work that #43447 does not need -------------------------------------------------------- - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //= compress_ratio before delegating to super(). DSv4 indexer path; #43447 has no equivalent because it has no compressor. - MambaManager.cache_blocks: signature sync (transparent passthrough) so the new retention_interval / alignment_tokens kwargs do not raise TypeError when the coordinator threads them through every manager. - SingleTypeKVCacheManager.cache_blocks: retention_interval AND alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid groups carry their own per-manager alignment, distinct from the coordinator's lcm_block_size). - SingleTypeKVCacheManager.reachable_block_mask base hook returning None (safe default) so non-SWA managers stay transparent. AscendHybridKVCacheCoordinator integration ------------------------------------------ - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is None, validates against ascend's lcm_block_size (not the upstream scheduler_block_size which does not exist on v0.20.2), and stores self.local_kv_retention_interval. - _init_prefix_cache_retention_metadata pre-seeds every manager with _prefix_cache_alignment_tokens (= lcm_block_size) and _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries EAGLE handling inside SlidingWindowManager directly; vllm-ascend's multi-group EAGLE detection lives at the coordinator level, hence the per-manager metadata seeding here. - cache_blocks override threads local_kv_retention_interval to every manager.cache_blocks call. Implementation detail of block_mask handling -------------------------------------------- #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks. Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily marks masked-out blocks with .is_null = True, delegates to the unmodified original (which already has an is_null skip path), and restores the flag in finally. Same semantic outcome, no vllm core change required. What got changed (eight files): vllm_ascend/core/single_type_kv_cache_manager.py (+87 / -9) vllm_ascend/models/layer/attention/layer.py (+4 / -2) vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6) vllm_ascend/patch/platform/patch_kv_cache_interface.py (+79 / -6) vllm_ascend/patch/platform/patch_kv_cache_utils.py (+36 / -4) vllm_ascend/patch/worker/patch_deepseek_compressor.py (+5 / -2) vllm_ascend/worker/block_table.py (+5 / -6) vllm_ascend/worker/model_runner_v1.py (+24 / -10) Activation: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto -> AUTO_RETENTION_INTERVAL=32768 VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int> -> fixed token interval VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0 -> keep only latest replay tail unset -> dense legacy behavior Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls through to the original on None / no-mask). When set, only the set of blocks recorded in the global hash->block dict changes -- block contents and the cache lookup path are unchanged. Signed-off-by: liuchenbing <chenliumail@163.com>
…n for sliding-window KV cache (vllm-project#43447) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Co-author: @ivanium
Purpose
DeepSeek v4 now exhibits very low effective prefix cache capacity. For example, on TP8 with 8xB300, the reported KV cache capacity is ~14.5x concurrency. However, a microbenchmark that sends 1M-context requests sequentially shows that after the second request is sent, replaying the first request already begins to miss the prefix cache. This means the practical prefix-cache retention capacity is much lower than the expected KV cache capacity.
After investigation, the root cause is that as a new request progresses, its allocation of sliding window cache blocks, despite constantly being freed as window moves, flushes away the existing prefix cache of older requests in the free queue.
This PR addresses the issue by prepending the non-cached blocks to the front of the free queue while appending the cached blocks to the back of the queue. This way, the non-cached blocks are prioritized for reuse, preventing the cached blocks from being flushed away from the transient sliding window block allocation of concurrent requests.
In addition, this PR supports selective cache retention feature to only save SWA cache checkpoints at certain checkpointing intervals, allowing for greater requent-level KV cache capacity. This is enabled through
VLLM_PREFIX_CACHE_RETENTION_INTERVAL:None(default): preserve the old behavior (all windows of block-size-aligned tokens are cached).0: retain only the latest replayable prompt boundary.After this change, the same microbenchmark shows >95% prefix-cache hit rate for 14 concurrent 1M-context requests across prompt fractions from 10% to 100%, matching the reported ~14.5x KV cache capacity.
Test Plan
tests/v1/core/test_prefix_caching.pyTest Result
GSM8K:
Trace replay -
semianalysisai/cc-traces-weka-no-subagents-051826:main:
branch:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.