Skip to content

[Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache#43447

Merged
ywang96 merged 13 commits into
vllm-project:mainfrom
wzhao18:wzhao/dsv4-prefix-caching-retention
Jun 4, 2026
Merged

[Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache#43447
ywang96 merged 13 commits into
vllm-project:mainfrom
wzhao18:wzhao/dsv4-prefix-caching-retention

Conversation

@wzhao18

@wzhao18 wzhao18 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Co-author: @ivanium

Purpose

DeepSeek v4 now exhibits very low effective prefix cache capacity. For example, on TP8 with 8xB300, the reported KV cache capacity is ~14.5x concurrency. However, a microbenchmark that sends 1M-context requests sequentially shows that after the second request is sent, replaying the first request already begins to miss the prefix cache. This means the practical prefix-cache retention capacity is much lower than the expected KV cache capacity.

After investigation, the root cause is that as a new request progresses, its allocation of sliding window cache blocks, despite constantly being freed as window moves, flushes away the existing prefix cache of older requests in the free queue.

This PR addresses the issue by prepending the non-cached blocks to the front of the free queue while appending the cached blocks to the back of the queue. This way, the non-cached blocks are prioritized for reuse, preventing the cached blocks from being flushed away from the transient sliding window block allocation of concurrent requests.

In addition, this PR supports selective cache retention feature to only save SWA cache checkpoints at certain checkpointing intervals, allowing for greater requent-level KV cache capacity. This is enabled through VLLM_PREFIX_CACHE_RETENTION_INTERVAL:

  • None (default): preserve the old behavior (all windows of block-size-aligned tokens are cached).
  • 0: retain only the latest replayable prompt boundary.
  • positive integer: retain checkpoint tails at the configured intervals, plus the latest replayable prompt boundary.

After this change, the same microbenchmark shows >95% prefix-cache hit rate for 14 concurrent 1M-context requests across prompt fractions from 10% to 100%, matching the reported ~14.5x KV cache capacity.

Test Plan

  • tests/v1/core/test_prefix_caching.py
  • E2E DeepSeek V4 prefix caching eval + performance

Test Result

VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --block-size 256 \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --attention_config.use_fp4_indexer_cache True \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --max-cudagraph-capture-size 2048 \
  --max-num-batched-tokens 2048 \
  --no-enable-flashinfer-autotune \
  --tensor-parallel-size 8

GSM8K:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9560|±  |0.0056|

Trace replay - semianalysisai/cc-traces-weka-no-subagents-051826:

uv run --active aiperf profile \
    --scenario inferencex-agentx-mvp \
    --url localhost:8000 \
    --model deepseek-ai/DeepSeek-V4-Pro \
    --tokenizer deepseek-ai/DeepSeek-V4-Pro \
    --num-dataset-entries 739 \
    --endpoint-type chat \
    --streaming \
    --public-dataset semianalysis_cc_traces_weka_no_subagents \
    --benchmark-duration 1800 \
    --concurrency 16 \
    --use-server-token-count \
    --ui simple

main:

rps=0.2 (avg 0.1) tput_in=11025/s tput_out=44/s done=161 ok=161 err=0
ttft p50=86023ms p75=108751ms p95=127065ms p99=134300ms
itl  p50=155ms  p75=169ms  p95=179ms  p99=190ms
e2e  p50=127751ms p75=173693ms p95=410403ms p99=755707ms
tin  p50=1,072  p75=2,381  p95=6,253  p99=9,344 (tok/s/user)
tout p50=6      p75=7      p95=7      p99=16 (tok/s/user)
seq  isl_avg=123,177    osl_avg=489
tot  in=19,831,512     out=78,775
srv  prefix_cache_hit=0.0% kv_usage=10.3%

branch:

rps=0.4 (avg 0.3) tput_in=37266/s tput_out=196/s done=560 ok=560 err=0
ttft p50=4499ms p75=7092ms p95=12696ms p99=24552ms
itl  p50=54ms   p75=69ms   p95=119ms  p99=156ms
e2e  p50=20679ms p75=39971ms p95=142013ms p99=302469ms
tin  p50=19,381 p75=38,365 p95=139,270 p99=160,284 (tok/s/user)
tout p50=19     p75=26     p95=51     p99=61 (tok/s/user)
seq  isl_avg=120,992    osl_avg=637
tot  in=67,755,382     out=356,899
srv  prefix_cache_hit=74.3% kv_usage=3.8%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@wzhao18 wzhao18 changed the title [Prefix Caching] DeepSeekv4 retain sliding window cache only at specified interval [Prefix Caching] DeepSeekv4 only retain sliding window cache at specified interval boundaries May 22, 2026
@mergify mergify Bot added deepseek Related to DeepSeek models v1 labels May 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sparse local retention mechanism for prefix caching, specifically targeting hybrid KV cache models. It adds a new configuration option, prefix_cache_retention_interval, which allows the system to retain sliding-window KV checkpoints at specified token intervals or at the latest prompt boundary. Key changes include updates to the block pool and free queue to support prioritized block reuse, and modifications to the KV cache managers to handle the lifecycle of these retained checkpoint blocks. I have no feedback to provide as there were no review comments.

@wzhao18 wzhao18 changed the title [Prefix Caching] DeepSeekv4 only retain sliding window cache at specified interval boundaries [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache May 23, 2026
@wzhao18 wzhao18 force-pushed the wzhao/dsv4-prefix-caching-retention branch 2 times, most recently from ac5c6f4 to 12741ac Compare May 24, 2026 04:55
@wzhao18 wzhao18 marked this pull request as ready for review May 24, 2026 22:20
@wzhao18 wzhao18 force-pushed the wzhao/dsv4-prefix-caching-retention branch from 12741ac to bccfd67 Compare May 25, 2026 18:25

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is a valuable direction. I left a few comments. Happy to discuss.

Comment thread vllm/config/vllm.py Outdated
Comment on lines +2153 to +2156
logger.warning(
"--prefix-cache-retention-interval is only effective when "
"prefix caching is enabled. This flag is ignored."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also set self.self.cache_config.prefix_cache_retention_interval = None here for safety

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread vllm/v1/core/kv_cache_coordinator.py Outdated
Comment on lines +446 to +449
has_sliding_window_group = any(
isinstance(manager, SlidingWindowManager)
for manager in self.single_type_managers
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to leave a note that this should also work for Mamba groups

@wzhao18 wzhao18 May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we revisit Mamba in a separate PR to limit the scope of this PR, especially we need to align with the behaviors of the different Mamba modes e.g., "align", "all" etc

self.num_cached_block.pop(request_id, None)
return

ordered_blocks = list(reversed(req_blocks))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: this should be unnecessary as free_blocks() handles it.
On a second thought, I feel here we can prepend blocks without block_hash to the free list, similar to the logic below in the kv cache manager

@wzhao18 wzhao18 May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the changes here are legacy of my initial changes which performs the similar logic as the sliding window - prepend non-cache and append cached.

I eventually simplified it because I think the non-cached blocks should be already handled by remove_skipped_blocks before free is called. Let me check whether there is still need to do that here as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the similar logic to free of the SWA KV manager.

Comment on lines 450 to +515
@@ -456,9 +503,16 @@ def remove_skipped_blocks(
# should also have been set to null blocks by the previous calls
# to this function.
break
removed_blocks.append(blocks[i])
if blocks[i].block_hash is None:
removed_uncached_blocks.append(blocks[i])
else:
removed_cached_blocks.append(blocks[i])
blocks[i] = self._null_block
self.block_pool.free_blocks(removed_blocks)
# `prepend=True` makes uncached scratch blocks the next allocation
# candidates, while cached blocks stay behind them as best-effort
# prefix-cache entries.
self.block_pool.free_blocks(removed_cached_blocks)
self.block_pool.free_blocks(removed_uncached_blocks, prepend=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is nice, and I think it actually helps the current DSV4 even without this retention mechanism. This is because we have masked out non-256-boundary SWA blocks so they won't have block_hash, but previously we didn't put it into the front of the free list so they may not be re-allocate the first.

Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment on lines +746 to +755
# optionally cache the latest prompt boundary. It is fixed for the
# lifetime of the request (derived from num_prompt_tokens), so cache it
# at most once to avoid redundant tail-mask work on every decode step.
if (
latest_boundary_token is not None
and latest_boundary_token <= num_tokens
and request_id not in self._latest_retention_cached
):
self._cache_tail_at_boundary(request, latest_boundary_token)
self._latest_retention_cached.add(request_id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we discussed offline on caching at the end of each turn's prompt, and I personally think this is the most helpful part

@dafeliton

Copy link
Copy Markdown

Hi there,

Does this work with mtp? I'm noticing a 0% cache prefix hit rate with mtp enabled.

@ivanium

ivanium commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Hi there,

Does this work with mtp? I'm noticing a 0% cache prefix hit rate with mtp enabled.

Thanks for raising the issue. It's a good point. I think this PR doesn't support MTP yet because MTP will additionally drop a block. But this should be fixable.

@wzhao18 wzhao18 force-pushed the wzhao/dsv4-prefix-caching-retention branch 2 times, most recently from fcf7434 to fbd05dd Compare May 26, 2026 04:08
@wzhao18

wzhao18 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@dafeliton I just added support for MTP. Could you test it again?

@wzhao18 wzhao18 force-pushed the wzhao/dsv4-prefix-caching-retention branch 2 times, most recently from 641c1b6 to 7c909f8 Compare June 2, 2026 21:52
@wzhao18

wzhao18 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

I've updated the PR with suggestions from @ivanium. It now drops the CLI knob --prefix-cache-retention-interval and instead uses env var VLLM_PREFIX_CACHE_RETENTION_INTERVAL. We will leave the design of a proper interface for the cache retention strategy to a future PR.

For now, a good default value for VLLM_PREFIX_CACHE_RETENTION_INTERVAL may be 0 if only the full prompt will be reused, or 32k if we also want partial prompts to have cache hits.

cquil11 added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 2, 2026
The custom cquil/vllm-openai image integrates vllm-project/vllm#43447,
which fixes the DSv4 sliding-window prefix-cache eviction issue. But the
fix is opt-in via VLLM_PREFIX_CACHE_RETENTION_INTERVAL — without setting
it, vllm falls back to the legacy cache-every-segment path that this PR
was written to repair, so the trace-replay cache hit rate stays near 0%
even though the patched code is loaded.

Sets the env var to 32768 (32k tokens), matching the value the PR author
validated to take cache hit rate from 0% -> 74% on a comparable agentic
trace-replay benchmark.

On stock vllm images that don't carry the patch, the env var is simply
ignored — safe to land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11 added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 3, 2026
…ention-interval env

7ead0a0 only carried the "Prepend uncached blocks in SWA free()" hunk
of PR vllm-project/vllm#43447 — it did NOT modify vllm/envs.py to
register the VLLM_PREFIX_CACHE_RETENTION_INTERVAL env var. That
registration didn't land until commit 7c909f8 in the PR, and 6c529f30
is the latest merge of main into the PR branch.

Effect: the export in dsv4_fp4_b300_vllm.sh (1bccc5c) finally takes
effect — vllm stops logging "Unknown vLLM environment variable detected"
and actually activates the SWA prefix-cache retention path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@njhill njhill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wzhao18 @ivanium! Great work

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 3, 2026
…v4 SWA

Backports the core mechanism of vllm-project/vllm#43447 ("Selective
prefix-cache retention for sliding-window KV cache") onto the v0.20.2rc
vllm base used by vllm-ascend.

Mechanism: when sliding-window blocks roll out of the window, the manager
hands them back to the free queue. Baseline behavior appends them all to
the back, where in concurrent long-context workloads the *uncached*
scratch blocks of a new request push out *cached* prefix blocks of older
requests. The fix:

  - Uncached (block_hash is None) -> prepend to free queue front
  - Cached                         -> append (default) to queue back

On vllm main's DSv4 trace replay this lifts prefix_cache_hit from 0% to
74.3% under 16-concurrency 1M-context traffic and gives 3-4x throughput
plus ~19x TTFT improvement.

Implementation:
  - vllm_ascend/patch/worker/patch_prefix_cache_retention.py monkey-patches
    four call sites under VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION env gate:
      1. FreeKVCacheBlockQueue.prepend_n (new method)
      2. BlockPool.free_blocks (adds prepend=False kwarg)
      3. SlidingWindowManager.remove_skipped_blocks (split cached/uncached)
      4. SlidingWindowManager.free (same split on request free)
  - Env switch defaults to 0: when off, all paths route to the original
    methods bit-for-bit, so accuracy is unaffected.
  - Targets SlidingWindowManager only (DSv4 SWA cache + indexer compressor
    register SlidingWindowMLASpec which inherits SlidingWindowSpec).
    CompressAttentionManager (MLAAttentionSpec) and FullAttentionManager
    paths are untouched.

Selective retention (VLLM_PREFIX_CACHE_RETENTION_INTERVAL sparse SWA tail
checkpointing) is intentionally NOT in this first pass; the free-queue
ordering change alone captures most of the hit-rate win in PR #43447's
benchmark.

Activation:
  VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION=1 vllm serve ...

Expected effect on DSv4 + 1M-context concurrent traces: prefix cache hit
rate goes from near-0% to 70%+, throughput 3-4x.

Co-authored-by: Claude
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 4, 2026
…v4 SWA

Backports the core mechanism of vllm-project/vllm#43447 ("Selective
prefix-cache retention for sliding-window KV cache") onto the v0.20.2rc
vllm base used by vllm-ascend.

Mechanism: when sliding-window blocks roll out of the window, the manager
hands them back to the free queue. Baseline behavior appends them all to
the back, where in concurrent long-context workloads the *uncached*
scratch blocks of a new request push out *cached* prefix blocks of older
requests. The fix:

  - Uncached (block_hash is None) -> prepend to free queue front
  - Cached                         -> append (default) to queue back

On vllm main's DSv4 trace replay this lifts prefix_cache_hit from 0% to
74.3% under 16-concurrency 1M-context traffic and gives 3-4x throughput
plus ~19x TTFT improvement.

Implementation:
  - vllm_ascend/patch/worker/patch_prefix_cache_retention.py monkey-patches
    four call sites under VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION env gate:
      1. FreeKVCacheBlockQueue.prepend_n (new method)
      2. BlockPool.free_blocks (adds prepend=False kwarg)
      3. SlidingWindowManager.remove_skipped_blocks (split cached/uncached)
      4. SlidingWindowManager.free (same split on request free)
  - Env switch defaults to 0: when off, all paths route to the original
    methods bit-for-bit, so accuracy is unaffected.
  - Targets SlidingWindowManager only (DSv4 SWA cache + indexer compressor
    register SlidingWindowMLASpec which inherits SlidingWindowSpec).
    CompressAttentionManager (MLAAttentionSpec) and FullAttentionManager
    paths are untouched.

Selective retention (VLLM_PREFIX_CACHE_RETENTION_INTERVAL sparse SWA tail
checkpointing) is intentionally NOT in this first pass; the free-queue
ordering change alone captures most of the hit-rate win in PR #43447's
benchmark.

Activation:
  VLLM_ASCEND_ENABLE_PREFIX_CACHE_RETENTION=1 vllm serve ...

Expected effect on DSv4 + 1M-context concurrent traces: prefix cache hit
rate goes from near-0% to 70%+, throughput 3-4x.

Co-authored-by: Claude
@ilyaters

ilyaters commented Jun 4, 2026

Copy link
Copy Markdown

@wzhao18 @ivanium Thank you! Grateful work!

@ywang96 ywang96 merged commit a618356 into vllm-project:main Jun 4, 2026
67 of 69 checks passed
JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026
…n for sliding-window KV cache (vllm-project#43447)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: JisoLya <523420504@qq.com>
vrdn-23 added a commit to vrdn-23/vllm that referenced this pull request Jun 5, 2026
Resolves the recurring envs.py merge conflict per
docs/superpowers/specs/2026-05-14-envs-merge-conflict-resolution-design.md.

The legacy `if TYPE_CHECKING:` block and `environment_variables: dict[str,
Callable]` runtime mapping were dropped on the branch in favor of pydantic
`*Settings(BaseSettings)` subclasses. Every main-side edit to either
location therefore conflicts mechanically; structural resolution is
`--ours` for vllm/envs.py, then port the semantic delta as new `Field(...)`
declarations on the appropriate sub-model.

Main-side commits since merge base afcb580, with port disposition:

- c73b0d0 (vllm-project#44669) — adds VLLM_RAY_DP_PLACEMENT_NODE_IPS (str=""). Ported
  to DistributedSettings.ray_dp_placement_node_ips.
- 165b786 (vllm-project#40426) — adds VLLM_ROCM_USE_AITER_LINEAR_HIPBMM (bool=False).
  Ported to RocmSettings.rocm_use_aiter_linear_hipbmm. Native pydantic bool
  parsing replaces the `.lower() in ("true","1")` lambda.
- 38fd240 (vllm-project#41980) — adds VLLM_DISTRIBUTED_USE_SPLIT_GROUP (bool=False).
  Ported to DistributedSettings.distributed_use_split_group. Native
  pydantic bool parsing replaces the `bool(int(...))` lambda.
- a618356 (vllm-project#43447) — adds VLLM_PREFIX_CACHE_RETENTION_INTERVAL
  (int|None=None, tri-state). Ported to
  ServerSettings.prefix_cache_retention_interval; pydantic's
  unset-vs-explicit-zero handling matches the original
  `"X" in os.environ` guard.
- bd98e97 (vllm-project#44128) — removes dead VLLM_RPC_TIMEOUT. Mirrored on the
  branch by deleting ServerSettings.rpc_timeout.

Verification: vllm.envs imports cleanly; all four new vars read defaults
and parse env-set values (incl. tri-state INTERVAL=0); VLLM_RPC_TIMEOUT
correctly raises AttributeError; pre-commit passes ruff/format/mypy.

Signed-off-by: Vinay Damodaran <vrdn@hey.com>
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 6, 2026
Backport selective prefix-cache retention for sliding-window KV cache
(vllm-project/vllm#43447) onto vllm-ascend releases/v0.20.2rc.

Mechanism: when sliding-window blocks roll out of the window or a request
finishes, the manager hands them back to the free queue. Under concurrent
long-context workloads, the *uncached* scratch blocks of a new request
flush away the *cached* prefix blocks of older requests because both sit
in the same queue and the cached ones are older.

This patch installs the upstream mechanism via import-time monkey-patches
on a single file:

  vllm_ascend/patch/platform/patch_kv_cache_coordinator.py

What gets patched:
  1. FreeKVCacheBlockQueue.prepend_n          -- new method (queue-head insert).
  2. BlockPool.cache_full_blocks              -- adds block_mask=None kwarg.
  3. BlockPool.free_blocks                    -- adds prepend=False kwarg.
  4. SingleTypeKVCacheManager.cache_blocks    -- adds retention_interval +
                                                 alignment_tokens kwargs.
  5. SingleTypeKVCacheManager.remove_skipped_blocks
                                              -- split cached / uncached
                                                 on window-slide free.
  6. SlidingWindowManager._contiguous_blocks_for_hit  -- helper.
  7. SlidingWindowManager.reachable_block_mask
                                              -- core selective retention
                                                 algorithm (segment tails
                                                 + replay boundary tail).
  8. SlidingWindowManager.free                -- split cached / uncached on
                                                 request finish.
  9. MambaManager.cache_blocks                -- signature sync (passthrough).
 10. CompressAttentionManager.cache_blocks    -- signature sync + divide
                                                 num_tokens by compress_ratio
                                                 for the DSv4 indexer path.

Plus, on AscendHybridKVCacheCoordinator:
  - __init__ reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL, validates, and
    stores self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata() pre-seeds per-manager
    _prefix_cache_alignment_tokens and _prefix_cache_use_eagle (the latter
    set true for EAGLE/MTP groups).
  - cache_blocks() override threads local_kv_retention_interval through
    to every manager.

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto     -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>    -> fixed interval (must be a
                                                   multiple of alignment)
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0        -> keep only the latest
                                                   replayable prompt boundary
  unset                                          -> dense legacy behavior

Accuracy: untouched. All patches fall through to the original methods when
the env is unset, so unset preserves bit-for-bit behavior. When set, only
the set of blocks that land in the global hash->block dict changes -
block contents and the rest of the cache lookup path are unchanged.

Effect (vllm-project/vllm#43447 trace replay, GPU side):
  prefix_cache_hit 0% -> 74.3% under 16-concurrency 1M-context traffic
  TTFT p50         86 s -> 4.5 s  (~19x)
  input throughput 11k tok/s -> 37k tok/s (~3.4x)
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 6, 2026
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the
v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts
where dense SWA caching has older requests' cached blocks flushed by
concurrent requests' scratch allocations.

The reference mechanism comes from vllm-project/vllm#43447 (which lands
the same mechanism but inside the vllm core). This change is NOT a
mechanical mirror of #43447 -- the deltas matter:

  Delivery surface
  ----------------
  vllm #43447: edits five vllm core files (envs.py, block_pool.py,
    kv_cache_utils.py, kv_cache_coordinator.py, single_type_kv_cache_manager.py)
    and ships them in a vllm release.
  here:        single-file edit on vllm-ascend, all hooks installed via
    import-time monkey-patches with hasattr() guards. No vllm source change
    required; works against the unmodified vllm v0.20.2 pinned by v0.20.2rc.

  Decoupling from vllm
  --------------------
  AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager imports
  are wrapped in a try/except. When vllm lacks #43447 symbols (current
  v0.20.2 case), we fall back to literal 1024 / 32768 + the local
  SlidingWindowManager so the monkey-patch installs cleanly anyway.
  Result: the env switch works on vllm v0.20.2 today, and will continue to
  work transparently if vllm later ships #43447.

  Ascend-specific signature work that #43447 does not need
  --------------------------------------------------------
  - CompressAttentionManager.cache_blocks: signature sync PLUS num_tokens //=
    compress_ratio before delegating to super(). This is the DSv4 indexer
    path; vllm #43447 has no equivalent because it does not have a compressor
    manager.
  - MambaManager.cache_blocks: signature sync (transparent passthrough) so
    the new retention_interval / alignment_tokens kwargs do not raise
    TypeError when the coordinator threads them through every manager.
  - SingleTypeKVCacheManager.cache_blocks: retention_interval AND
    alignment_tokens kwargs (the latter is needed because vllm-ascend hybrid
    groups carry their own per-manager alignment, distinct from the
    coordinator's lcm_block_size).
  - SingleTypeKVCacheManager.reachable_block_mask base hook returning None
    (safe default) so any non-SWA manager that did not override it stays
    transparent.

  AscendHybridKVCacheCoordinator integration
  ------------------------------------------
  - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg is
    None, validates against ascend's lcm_block_size (not the upstream
    scheduler_block_size which does not exist on v0.20.2), and stores
    self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata: pre-seeds every manager with
    _prefix_cache_alignment_tokens (= lcm_block_size) and
    _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries
    EAGLE handling inside SlidingWindowManager directly; vllm-ascend's
    multi-group EAGLE detection lives at the coordinator level, hence the
    per-manager metadata seeding here.
  - cache_blocks override: threads local_kv_retention_interval to every
    manager.cache_blocks call so the new retention path is reachable from
    the existing ascend manager classes.

  Implementation detail of block_mask handling
  --------------------------------------------
  vllm #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks.
  Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily
  marks masked-out blocks with .is_null = True, delegates to the unmodified
  original (which already has an is_null skip path), and restores the flag
  in finally. Same semantic outcome, no vllm core change required.

What gets patched (single file, all in
vllm_ascend/patch/platform/patch_kv_cache_coordinator.py):
  1.  FreeKVCacheBlockQueue.prepend_n             -- new method.
  2.  BlockPool.cache_full_blocks                 -- adds block_mask=None via
                                                     is_null delegation trick.
  3.  BlockPool.free_blocks                       -- adds prepend=False.
  4.  SingleTypeKVCacheManager.cache_blocks       -- adds retention_interval +
                                                     alignment_tokens kwargs.
  5.  SingleTypeKVCacheManager.remove_skipped_blocks
                                                  -- split cached/uncached
                                                     on window-slide free.
  6.  SlidingWindowManager._contiguous_blocks_for_hit  -- helper.
  7.  SlidingWindowManager.reachable_block_mask   -- selective retention algo.
  8.  SlidingWindowManager.free                   -- split cached/uncached on
                                                     request finish.
  9.  MambaManager.cache_blocks                   -- signature sync.
  10. CompressAttentionManager.cache_blocks       -- signature sync + DSv4
                                                     compress_ratio handling.

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto    -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>   -> fixed token interval
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0       -> keep only the latest
                                                  replayable prompt boundary
  unset                                         -> dense legacy behavior

Accuracy: env-unset path is bit-for-bit identical (every monkey-patch falls
through to the original on None / no-mask). When set, only the set of
blocks recorded in the global hash->block dict changes -- block contents
and the cache lookup path are unchanged.
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 6, 2026
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the
v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts
where dense SWA caching has older requests' cached blocks flushed by
concurrent requests' scratch allocations.

The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea
inside vllm core). This change is NOT a mechanical mirror of that PR --
the deltas matter:

  Delivery surface
  ----------------
  #43447  edits five vllm core files (envs.py, block_pool.py,
          kv_cache_utils.py, kv_cache_coordinator.py,
          single_type_kv_cache_manager.py) and ships them in a vllm release.
  here:   eight vllm-ascend files; all hooks installed via import-time
          monkey-patches with hasattr() guards. No vllm source change
          required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc.

  Decoupling from vllm
  --------------------
  AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager
  imports are wrapped in try/except. When vllm lacks #43447 symbols (the
  current v0.20.2 case), the patch falls back to literal 1024 / 32768 and
  the local SlidingWindowManager so the monkey-patch installs cleanly. The
  env switch works on vllm v0.20.2 today, and keeps working transparently
  if vllm later ships #43447.

  Ascend-specific signature work that #43447 does not need
  --------------------------------------------------------
  - CompressAttentionManager.cache_blocks: signature sync PLUS
    num_tokens //= compress_ratio before delegating to super(). DSv4
    indexer path; #43447 has no equivalent because it has no compressor.
  - MambaManager.cache_blocks: signature sync (transparent passthrough)
    so the new retention_interval / alignment_tokens kwargs do not raise
    TypeError when the coordinator threads them through every manager.
  - SingleTypeKVCacheManager.cache_blocks: retention_interval AND
    alignment_tokens kwargs (the latter is needed because vllm-ascend
    hybrid groups carry their own per-manager alignment, distinct from
    the coordinator's lcm_block_size).
  - SingleTypeKVCacheManager.reachable_block_mask base hook returning
    None (safe default) so non-SWA managers stay transparent.

  AscendHybridKVCacheCoordinator integration
  ------------------------------------------
  - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg
    is None, validates against ascend's lcm_block_size (not the upstream
    scheduler_block_size which does not exist on v0.20.2), and stores
    self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata pre-seeds every manager with
    _prefix_cache_alignment_tokens (= lcm_block_size) and
    _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries
    EAGLE handling inside SlidingWindowManager directly; vllm-ascend's
    multi-group EAGLE detection lives at the coordinator level, hence
    the per-manager metadata seeding here.
  - cache_blocks override threads local_kv_retention_interval to every
    manager.cache_blocks call.

  Implementation detail of block_mask handling
  --------------------------------------------
  #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks.
  Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily
  marks masked-out blocks with .is_null = True, delegates to the unmodified
  original (which already has an is_null skip path), and restores the
  flag in finally. Same semantic outcome, no vllm core change required.

What got changed (eight files):
  vllm_ascend/core/single_type_kv_cache_manager.py        (+87 / -9)
  vllm_ascend/models/layer/attention/layer.py             (+4  / -2)
  vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_interface.py  (+79 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_utils.py      (+36 / -4)
  vllm_ascend/patch/worker/patch_deepseek_compressor.py   (+5  / -2)
  vllm_ascend/worker/block_table.py                       (+5  / -6)
  vllm_ascend/worker/model_runner_v1.py                   (+24 / -10)

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto    -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>   -> fixed token interval
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0       -> keep only latest replay tail
  unset                                         -> dense legacy behavior

Accuracy: env-unset path is bit-for-bit identical (every monkey-patch
falls through to the original on None / no-mask). When set, only the
set of blocks recorded in the global hash->block dict changes -- block
contents and the cache lookup path are unchanged.
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 6, 2026
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the
v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts
where dense SWA caching has older requests' cached blocks flushed by
concurrent requests' scratch allocations.

The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea
inside vllm core). This change is NOT a mechanical mirror of that PR --
the deltas matter:

  Delivery surface
  ----------------
  #43447  edits five vllm core files (envs.py, block_pool.py,
          kv_cache_utils.py, kv_cache_coordinator.py,
          single_type_kv_cache_manager.py) and ships them in a vllm release.
  here:   eight vllm-ascend files; all hooks installed via import-time
          monkey-patches with hasattr() guards. No vllm source change
          required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc.

  Decoupling from vllm
  --------------------
  AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager
  imports are wrapped in try/except. When vllm lacks #43447 symbols (the
  current v0.20.2 case), the patch falls back to literal 1024 / 32768 and
  the local SlidingWindowManager so the monkey-patch installs cleanly. The
  env switch works on vllm v0.20.2 today, and keeps working transparently
  if vllm later ships #43447.

  Ascend-specific signature work that #43447 does not need
  --------------------------------------------------------
  - CompressAttentionManager.cache_blocks: signature sync PLUS
    num_tokens //= compress_ratio before delegating to super(). DSv4
    indexer path; #43447 has no equivalent because it has no compressor.
  - MambaManager.cache_blocks: signature sync (transparent passthrough)
    so the new retention_interval / alignment_tokens kwargs do not raise
    TypeError when the coordinator threads them through every manager.
  - SingleTypeKVCacheManager.cache_blocks: retention_interval AND
    alignment_tokens kwargs (the latter is needed because vllm-ascend
    hybrid groups carry their own per-manager alignment, distinct from
    the coordinator's lcm_block_size).
  - SingleTypeKVCacheManager.reachable_block_mask base hook returning
    None (safe default) so non-SWA managers stay transparent.

  AscendHybridKVCacheCoordinator integration
  ------------------------------------------
  - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg
    is None, validates against ascend's lcm_block_size (not the upstream
    scheduler_block_size which does not exist on v0.20.2), and stores
    self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata pre-seeds every manager with
    _prefix_cache_alignment_tokens (= lcm_block_size) and
    _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries
    EAGLE handling inside SlidingWindowManager directly; vllm-ascend's
    multi-group EAGLE detection lives at the coordinator level, hence
    the per-manager metadata seeding here.
  - cache_blocks override threads local_kv_retention_interval to every
    manager.cache_blocks call.

  Implementation detail of block_mask handling
  --------------------------------------------
  #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks.
  Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily
  marks masked-out blocks with .is_null = True, delegates to the unmodified
  original (which already has an is_null skip path), and restores the
  flag in finally. Same semantic outcome, no vllm core change required.

What got changed (eight files):
  vllm_ascend/core/single_type_kv_cache_manager.py        (+87 / -9)
  vllm_ascend/models/layer/attention/layer.py             (+4  / -2)
  vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_interface.py  (+79 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_utils.py      (+36 / -4)
  vllm_ascend/patch/worker/patch_deepseek_compressor.py   (+5  / -2)
  vllm_ascend/worker/block_table.py                       (+5  / -6)
  vllm_ascend/worker/model_runner_v1.py                   (+24 / -10)

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto    -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>   -> fixed token interval
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0       -> keep only latest replay tail
  unset                                         -> dense legacy behavior

Accuracy: env-unset path is bit-for-bit identical (every monkey-patch
falls through to the original on None / no-mask). When set, only the
set of blocks recorded in the global hash->block dict changes -- block
contents and the cache lookup path are unchanged.
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…n for sliding-window KV cache (vllm-project#43447)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 8, 2026
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the
v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts
where dense SWA caching has older requests' cached blocks flushed by
concurrent requests' scratch allocations.

The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea
inside vllm core). This change is NOT a mechanical mirror of that PR --
the deltas matter:

  Delivery surface
  ----------------
  #43447  edits five vllm core files (envs.py, block_pool.py,
          kv_cache_utils.py, kv_cache_coordinator.py,
          single_type_kv_cache_manager.py) and ships them in a vllm release.
  here:   eight vllm-ascend files; all hooks installed via import-time
          monkey-patches with hasattr() guards. No vllm source change
          required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc.

  Decoupling from vllm
  --------------------
  AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager
  imports are wrapped in try/except. When vllm lacks #43447 symbols (the
  current v0.20.2 case), the patch falls back to literal 1024 / 32768 and
  the local SlidingWindowManager so the monkey-patch installs cleanly. The
  env switch works on vllm v0.20.2 today, and keeps working transparently
  if vllm later ships #43447.

  Ascend-specific signature work that #43447 does not need
  --------------------------------------------------------
  - CompressAttentionManager.cache_blocks: signature sync PLUS
    num_tokens //= compress_ratio before delegating to super(). DSv4
    indexer path; #43447 has no equivalent because it has no compressor.
  - MambaManager.cache_blocks: signature sync (transparent passthrough)
    so the new retention_interval / alignment_tokens kwargs do not raise
    TypeError when the coordinator threads them through every manager.
  - SingleTypeKVCacheManager.cache_blocks: retention_interval AND
    alignment_tokens kwargs (the latter is needed because vllm-ascend
    hybrid groups carry their own per-manager alignment, distinct from
    the coordinator's lcm_block_size).
  - SingleTypeKVCacheManager.reachable_block_mask base hook returning
    None (safe default) so non-SWA managers stay transparent.

  AscendHybridKVCacheCoordinator integration
  ------------------------------------------
  - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg
    is None, validates against ascend's lcm_block_size (not the upstream
    scheduler_block_size which does not exist on v0.20.2), and stores
    self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata pre-seeds every manager with
    _prefix_cache_alignment_tokens (= lcm_block_size) and
    _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries
    EAGLE handling inside SlidingWindowManager directly; vllm-ascend's
    multi-group EAGLE detection lives at the coordinator level, hence
    the per-manager metadata seeding here.
  - cache_blocks override threads local_kv_retention_interval to every
    manager.cache_blocks call.

  Implementation detail of block_mask handling
  --------------------------------------------
  #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks.
  Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily
  marks masked-out blocks with .is_null = True, delegates to the unmodified
  original (which already has an is_null skip path), and restores the
  flag in finally. Same semantic outcome, no vllm core change required.

What got changed (eight files):
  vllm_ascend/core/single_type_kv_cache_manager.py        (+87 / -9)
  vllm_ascend/models/layer/attention/layer.py             (+4  / -2)
  vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_interface.py  (+79 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_utils.py      (+36 / -4)
  vllm_ascend/patch/worker/patch_deepseek_compressor.py   (+5  / -2)
  vllm_ascend/worker/block_table.py                       (+5  / -6)
  vllm_ascend/worker/model_runner_v1.py                   (+24 / -10)

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto    -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>   -> fixed token interval
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0       -> keep only latest replay tail
  unset                                         -> dense legacy behavior

Accuracy: env-unset path is bit-for-bit identical (every monkey-patch
falls through to the original on None / no-mask). When set, only the
set of blocks recorded in the global hash->block dict changes -- block
contents and the cache lookup path are unchanged.

Signed-off-by: liuchenbing <chenliumail@163.com>
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request Jun 8, 2026
Brings DSv4 sliding-window prefix-cache retention to vllm-ascend on the
v0.20.2rc base. Closes the prefix-cache-hit-rate gap at 16K+ contexts
where dense SWA caching has older requests' cached blocks flushed by
concurrent requests' scratch allocations.

The mechanism mirrors vllm-project/vllm#43447 (which lands the same idea
inside vllm core). This change is NOT a mechanical mirror of that PR --
the deltas matter:

  Delivery surface
  ----------------
  #43447  edits five vllm core files (envs.py, block_pool.py,
          kv_cache_utils.py, kv_cache_coordinator.py,
          single_type_kv_cache_manager.py) and ships them in a vllm release.
  here:   eight vllm-ascend files; all hooks installed via import-time
          monkey-patches with hasattr() guards. No vllm source change
          required; works against unmodified vllm v0.20.2 pinned by v0.20.2rc.

  Decoupling from vllm
  --------------------
  AUTO_RETENTION_BASE / AUTO_RETENTION_INTERVAL / SlidingWindowManager
  imports are wrapped in try/except. When vllm lacks #43447 symbols (the
  current v0.20.2 case), the patch falls back to literal 1024 / 32768 and
  the local SlidingWindowManager so the monkey-patch installs cleanly. The
  env switch works on vllm v0.20.2 today, and keeps working transparently
  if vllm later ships #43447.

  Ascend-specific signature work that #43447 does not need
  --------------------------------------------------------
  - CompressAttentionManager.cache_blocks: signature sync PLUS
    num_tokens //= compress_ratio before delegating to super(). DSv4
    indexer path; #43447 has no equivalent because it has no compressor.
  - MambaManager.cache_blocks: signature sync (transparent passthrough)
    so the new retention_interval / alignment_tokens kwargs do not raise
    TypeError when the coordinator threads them through every manager.
  - SingleTypeKVCacheManager.cache_blocks: retention_interval AND
    alignment_tokens kwargs (the latter is needed because vllm-ascend
    hybrid groups carry their own per-manager alignment, distinct from
    the coordinator's lcm_block_size).
  - SingleTypeKVCacheManager.reachable_block_mask base hook returning
    None (safe default) so non-SWA managers stay transparent.

  AscendHybridKVCacheCoordinator integration
  ------------------------------------------
  - __init__: reads VLLM_PREFIX_CACHE_RETENTION_INTERVAL when the kwarg
    is None, validates against ascend's lcm_block_size (not the upstream
    scheduler_block_size which does not exist on v0.20.2), and stores
    self.local_kv_retention_interval.
  - _init_prefix_cache_retention_metadata pre-seeds every manager with
    _prefix_cache_alignment_tokens (= lcm_block_size) and
    _prefix_cache_use_eagle (set True on EAGLE/MTP groups). #43447 carries
    EAGLE handling inside SlidingWindowManager directly; vllm-ascend's
    multi-group EAGLE detection lives at the coordinator level, hence
    the per-manager metadata seeding here.
  - cache_blocks override threads local_kv_retention_interval to every
    manager.cache_blocks call.

  Implementation detail of block_mask handling
  --------------------------------------------
  #43447 adds a native block_mask parameter to BlockPool.cache_full_blocks.
  Here we cannot edit BlockPool, so the patched cache_full_blocks temporarily
  marks masked-out blocks with .is_null = True, delegates to the unmodified
  original (which already has an is_null skip path), and restores the
  flag in finally. Same semantic outcome, no vllm core change required.

What got changed (eight files):
  vllm_ascend/core/single_type_kv_cache_manager.py        (+87 / -9)
  vllm_ascend/models/layer/attention/layer.py             (+4  / -2)
  vllm_ascend/patch/platform/patch_kv_cache_coordinator.py(+88 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_interface.py  (+79 / -6)
  vllm_ascend/patch/platform/patch_kv_cache_utils.py      (+36 / -4)
  vllm_ascend/patch/worker/patch_deepseek_compressor.py   (+5  / -2)
  vllm_ascend/worker/block_table.py                       (+5  / -6)
  vllm_ascend/worker/model_runner_v1.py                   (+24 / -10)

Activation:
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto    -> AUTO_RETENTION_INTERVAL=32768
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=<int>   -> fixed token interval
  VLLM_PREFIX_CACHE_RETENTION_INTERVAL=0       -> keep only latest replay tail
  unset                                         -> dense legacy behavior

Accuracy: env-unset path is bit-for-bit identical (every monkey-patch
falls through to the original on None / no-mask). When set, only the
set of blocks recorded in the global hash->block dict changes -- block
contents and the cache lookup path are unchanged.

Signed-off-by: liuchenbing <chenliumail@163.com>
@wzhao18 wzhao18 mentioned this pull request Jun 9, 2026
32 tasks
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…n for sliding-window KV cache (vllm-project#43447)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants