[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend by FutureSkyFly · Pull Request #10193 · vllm-project/vllm-ascend

FutureSkyFly · 2026-06-08T12:12:44Z

What this PR does / why we need it?

Lifts DSv4 SWA prefix-cache hit rate on vllm-ascend releases/v0.20.2rc
from near-zero to ~50% by allowing the DSv4 compressor's KV-cache
block_size to be configured per-deployment instead of being hard-pinned
to 128 tokens.

Symptom

--max-model-len 16384+, --enable-prefix-caching
8K-input × concurrency 32 × repeat 0.9 sequential replay:
  prefix_cache_hit:        ~0%
  TTFT avg:                ~16400 ms (= cold prefill)

Root cause

The DSv4 compressor's KV-cache block size is hard-coded to 128 tokens
in two places:

vllm_ascend/models/layer/attention/layer.py::DSAAttention.get_kv_cache_spec()
returns MLAAttentionSpec(block_size=128, ...).
vllm_ascend/patch/worker/patch_deepseek_compressor.py::AscendDeepseekV4IndexerCache.get_kv_cache_spec()
returns AscendMLAAttentionSpec(block_size=128, ...).

Prompts whose token count is not an exact multiple of 128 leave their
tail block uncached. Under the standard prefix-matching loop, the moment
one tail block is uncached the matcher stops returning more blocks past
that point, so the entire trailing region of the prompt also misses
even when the body would otherwise have hit. Concurrent long-context
workloads aggravate this because every request has its own un-aligned
tail.

Fix

Make the DSv4 compressor's KV-cache block_size configurable through a
new env switch (default 128, so legacy behaviour is preserved
bit-for-bit). At block_size = 32 the alignment requirement shrinks by
4×, so 4× more prompt prefixes can match exactly.

Single env switch:

VLLM_ASCEND_DSV4_COMPRESSED_KV_BLOCK_SIZE=32 vllm serve ...
# valid: 32 / 64 / 128
# unset -> legacy 128 behaviour
# only effective when compress_ratio == 128

What changed

patch_kv_cache_interface.py — new get_dsv4_compressed_kv_block_size(compress_ratio)
resolver. Reads the env, validates the value is in {32, 64, 128},
short-circuits to default 128 when compress_ratio != 128.
Relaxes the DSV4 C128 block_size % compress_ratio == 0 check (which
previously rejected block_size < compress_ratio).
models/layer/attention/layer.py —
DSAAttention.get_kv_cache_spec() calls the new resolver instead of
hard-coding 128.
patch/worker/patch_deepseek_compressor.py —
AscendDeepseekV4IndexerCache.get_kv_cache_spec() calls the same
resolver so the indexer cache stays consistent with the DSA attention
cache spec.
patch_kv_cache_utils.py — KV-cache page size is now aggregated
across all KV groups (not only the first full-MLA group), and the
KVCacheTensor allocator skips tensors whose shared_by is empty.
Required because at block_size = 32 different KV groups can have
different page sizes.
core/single_type_kv_cache_manager.py —
CompressAttentionManager tracks _prefix_block_size separately from
the allocation block_size. Prefix hashing keeps using the original
token range so cache entries stay comparable across deployments with
different compressed block sizes.
worker/model_runner_v1.py — kv_cache_raw_tensors layer-name
mismatch now raises a structured error with missing / extra / tensor
summary instead of bare assert. Diagnostic-only; no inference math
change.
patch_kv_cache_coordinator.py + worker/block_table.py —
small _prefix_block_size plumbing and debug-print updates to keep
the multi-KV-group page-size logic legible.

NPU production measurements

Validated on Ascend A3 (TP8), DeepSeek-V4-Flash-w8a8-mtp, served on
vllm v0.20.2. Workload input 8192 x output 1024 x concurrency 32 x repeat 0.9, baseline 367b8e62 vs this PR.

Prefix cache hit rate + serving performance

Metric	baseline	this PR	Gain
prefix cache hit rate	`0.00%`	`81.54%`	hits
TTFT avg	`14593.2 ms`	`6098.9 ms`	2.39x faster
TTFT P90	`24798.3 ms`	`7018.8 ms`	3.53x faster
TPOT avg	`36.9 ms`	`27.9 ms`	1.32x faster (not degraded)
TPOT SLO_P90	`46.0 ms`	`30.0 ms`	1.53x faster
E2E time	`53.13 s`	`36.96 s`	1.44x faster
QPS	`0.6023`	`0.8658`	+43.75%
output throughput	`616.8 tok/s`	`886.6 tok/s`	+43.74%
E2E throughput	`5553.6 tok/s`	`7982.9 tok/s`	+43.74%
prefill token throughput	`561.6 tok/s`	`1343.8 tok/s`	2.39x

Key point: at output_len=1024 the TPOT improves rather than
regresses, so the old "more hits -> worse TPOT" shape no longer applies.

Second workload (baseline already partially hitting)

Where the baseline already hit 44.53% on the same workload, this PR
pushes it to 77.70% (+33.17 pp), still with no TPOT degradation:

Metric	baseline	this PR	Gain
prefix cache hit rate	`44.53%`	`77.70%`	+33.17 pp
TTFT avg	`14298.8 ms`	`8623.7 ms`	1.66x faster
TPOT avg	`35.4 ms`	`33.7 ms`	not degraded
E2E throughput	`20945.7 tok/s`	`24432.4 tok/s`	+16.65%
prefill token throughput	`2291.9 tok/s`	`3800.2 tok/s`	+65.81%

Accuracy (regression check)

GPQA Diamond (198 questions, temperature=0, 0-shot CoT):

item	value
`GPQA_diamond.accuracy`	`89.8989898989899`
summary	`89.90`
completed	`198 / 198`
failed	`0`

GPQA is a non-repeating workload (does not consume prefix-cache entries);
it is the precision regression check. Accuracy matches the production
baseline -- no precision regression from the smaller KV-cache block size.

Service environment used during validation

item	value
platform	`Ascend A3 (8 cards, TP8)`
image	`quay.io/ascend/vllm-ascend:nightly-releases-v0.20.2rc-a3`
model weights	`DeepSeek-V4-Flash-w8a8-mtp`
`--max-model-len`	`135000`
`--max-num-batched-tokens`	`8192`
`--tensor-parallel-size`	`8`
`--enable-expert-parallel`	yes

Expected impact

Scenario	hit-rate change	Reason
Prompts already 128-aligned	unchanged	No new tail to recover
Prompts with 1-127 token tail past the last 128-boundary	improves	The tail rounds to a 32-block boundary instead of being thrown away
Repeat-prompt traffic (`repeat_rate=0.9`, c32, 8K)	`0% -> 81.54%`	Validated above
Short context (< 32 tokens)	unchanged	Below `block_size = 32` floor; no caching possible either way

Does this PR introduce any user-facing change?

No when VLLM_ASCEND_DSV4_COMPRESSED_KV_BLOCK_SIZE is unset or set to
128. The resolver returns the legacy 128, the cache spec is
bit-for-bit identical, and the rest of the modified files take their
no-op default branches.

When set to 32 or 64, the only effective change is the page-layout
of the DSv4 compressor KV cache. Block contents, the cache lookup
math, and the attention compute itself are untouched. A previously
cacheable replay either still hits (its tail block survives) or misses
(its tail block landed in a smaller bucket that was already evicted);
on miss the request recomputes the same tokens it would have
recomputed under any cache miss.

References

vLLM version: v0.20.2 (commit bc150f50299199599673614f80d12a196f377655)

gemini-code-assist · 2026-06-08T12:20:50Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance degradation in DSv4 models where long-context requests suffer from poor prefix cache hit rates. The issue stems from the vLLM v0.20.2 base lacking an upstream retention mechanism, causing cached blocks to be evicted by concurrent scratch block allocations. By porting the necessary logic and adapting it for the vllm-ascend platform through platform-specific monkey-patches, this change ensures that tail blocks are preserved, thereby improving cache efficiency. The solution is configurable and maintains bit-for-bit parity with legacy behavior when the new retention interval is unset.

Highlights

Prefix Cache Retention: Ported the DSv4 sliding-window prefix-cache retention mechanism to vllm-ascend to prevent premature eviction of cached blocks during scratch allocations.
Non-Intrusive Implementation: Implemented the fix using import-time monkey-patches to ensure compatibility with the pinned vLLM v0.20.2 core without requiring direct modifications.
Configuration: Introduced the VLLM_PREFIX_CACHE_RETENTION_INTERVAL environment variable to allow users to control retention behavior, with an 'auto' mode for optimal defaults.
Performance Optimization: Resolved the near-zero prefix cache hit rate issue for 16K+ context replays, significantly reducing replay latency.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the KV cache management and coordinator logic in vllm-ascend to support DeepSeek-V4 compressed MLA with dynamic storage block sizes and prefix caching. Key changes include introducing storage_block_size and _prefix_block_size helpers, dynamically resolving block sizes from environment variables, updating memory usage and page size calculations, and adding extensive debug logging. The code review feedback highlights a critical bug in AscendHybridKVCacheCoordinator._prefix_block_size where UniformTypeKVCacheSpecs is not unpacked, leading to incorrect block size alignment and cache misses. Additionally, the reviewer noted that the PR title and description should be updated to conform to the repository's Pull Request Summary Style Guide.

gemini-code-assist · 2026-06-08T12:23:26Z

+    @staticmethod
+    def _prefix_block_size(spec: KVCacheSpec) -> int:
+        compress_ratio = max(getattr(spec, "compress_ratio", 1), 1)
+        storage_block_size = getattr(spec, "storage_block_size", None)
+        if compress_ratio > 1 and storage_block_size is not None:
+            return storage_block_size * compress_ratio
+        return spec.block_size


In AscendHybridKVCacheCoordinator._prefix_block_size, if spec is an instance of UniformTypeKVCacheSpecs, it does not have compress_ratio or storage_block_size as direct attributes. This causes getattr(spec, "compress_ratio", 1) to return 1 and fall back to spec.block_size (e.g., 128), instead of computing the correct prefix block size (e.g., 2048 for C128).

This will lead to an incorrect self.lcm_block_size and cause alignment issues or cache misses in find_longest_cache_hit.

We should unpack UniformTypeKVCacheSpecs first, matching the logic used in model_runner_v1.py.

@staticmethod def _prefix_block_size(spec: KVCacheSpec) -> int: from vllm.v1.kv_cache_interface import UniformTypeKVCacheSpecs if isinstance(spec, UniformTypeKVCacheSpecs): spec = next(iter(spec.kv_cache_specs.values())) compress_ratio = max(getattr(spec, "compress_ratio", 1), 1) storage_block_size = getattr(spec, "storage_block_size", None) if compress_ratio > 1 and storage_block_size is not None: return storage_block_size * compress_ratio return spec.block_size

gemini-code-assist · 2026-06-08T12:23:26Z

 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import itertools
+import os


The Pull Request title and description do not adhere to the repository's Pull Request Summary Style Guide.

Please update them to match the required formats:

Suggested PR Title:

[v0.20.2rc][Attention][BugFix] Port DSv4 SWA prefix-cache retention to vllm-ascend

Suggested PR Summary:

### What this PR does / why we need it? This PR ports the DeepSeek-V4 SWA prefix-cache retention mechanism to `vllm-ascend` on the `releases/v0.20.2rc` branch. It addresses the near-zero prefix cache hit rate at 16K+ contexts by implementing: 1. **Sparse checkpoint retention**: Only caching blocks at aligned boundaries plus the latest replay tail, reducing cached blocks per request from ~128 to ~1-2. 2. **Cached-vs-uncached free-queue split**: Routing uncached scratch blocks to the front of the free queue (reused first) and cached blocks to the back (surviving scratch churn). ### Does this PR introduce _any_ user-facing change? No, when `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` is unset. When set, it optimizes prefix cache retention behavior without changing block contents or attention computation. ### How was this patch tested? Tested on Ascend A3 (8 cards, TP8) with DeepSeek-V4-Flash-w8a8-mtp. - Verified prefix cache hit rate of 49.67% on aisbench prefix-cache workload. - Verified accuracy on GPQA Diamond (198 questions, 0-shot CoT) with 89.90% accuracy (no regression).

References

The Pull Request Summary Style Guide requires the PR title to follow the format '[Branch][Module][Action] Pull Request Title' and the summary to follow a specific H3-based markdown template. ^(link)

jhonzy0928 · 2026-06-09T08:20:44Z

In scenarios involving separate PDs (8 machines and a large EP), a phenomenon occurs where prefill is completed, but decoding loops infinitely, preventing inference from completing.

…ention + partial hit Lifts the near-zero prefix-cache hit rate on DSv4 long-context workloads served on v0.20.2rc by combining five mechanisms. 1. Configurable DSv4 compressor block size (32 / 64 / 128) - New lookup table DSV4_BLOCK_SIZES driven by --block-size CLI flag. - Synchronously updates: MLA block_size, SWA block_size, C4 state cache, C128 state cache, page_size_padded. - utils.refresh_block_size caches the resolved size in _ENIGINE_CORE_BLOCK_SIZE so multi-process EngineCore does not overwrite it. Defaults to 32 when unset. 2. AscendHybridKVCacheCoordinator interface alignment - __init__ accepts and threads local_kv_retention_interval, scheduler_block_size and eagle_attn_layer_names kwargs through to super(), so v0.22+ vllm with PR #43447 can bind directly without TypeError. 3. vLLM prefix-cache core mechanism backport (monkey-patch) - New patch_prefix_cache_core.py installs at import time: FreeKVCacheBlockQueue.prepend_n BlockPool.free_blocks(prepend=) BlockPool._maybe_evict_cached_block hook SlidingWindowManager.remove_skipped_blocks (cached / uncached split) SlidingWindowManager._cache_block_mask (sparse checkpoint) SlidingWindowManager.free (cached / uncached split) KVCacheManager.__init__ (scheduler_block_size plumbing) KVCacheManager.take_copy_block_ids (partial-hit copy queue) Scheduler.__init__ (scheduler_block_size derivation, lcm of group sizes) Scheduler copy-blocks dispatch - Guarded by _source_contains() so vllm that already ships the mechanism is not double-patched. 4. Scheduler block size alignment - scheduler_block_size = lcm(block sizes across all kv-cache groups) is threaded from Scheduler -> KVCacheManager -> Coordinator. - Fixes admission gates and prefix-hit length checks that mis-aligned when different KV groups use different block sizes (e.g. compressor at 32 + main MLA at 128). 5. DSv4 compressor partial prefix-cache hits - New ComputedBlockList class plus _hash_range / _insert_partial_cache / get_partial_cached_block helpers in single_type_kv_cache_manager. - CompressAttentionManager learns _num_partial_hit_blocks / _cache_partial_block_boundaries / take_copy_block_ids, and registers each request's partial-prefix boundaries exactly once (avoids the per-decode-step re-hash that previously stalled the scheduler loop). - Block-pool eviction cleans up partial cache entries for evicted blocks. - model_runner_v1 gains _copy_prefix_cache_blocks / _prefix_cache_data_ptrs / _copy_prefix_cache_tensor to perform the (src_block_id, dst_block_id, num_tokens) copies on-device. - Effect: when a new request shares only part of a 16K-token compressor block with a cached request, the shared prefix is reused via an in-cache memcpy instead of being recomputed. Activation: vllm serve --block-size 32 ... # 32 / 64 / 128, defaults to 32 # optional upstream-aligned env: VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto Measured performance (Ascend A3 / TP8, DeepSeek-V4-Flash-w8a8-mtp, input 8192 x output 1024 x concurrency 32 x repeat 0.9; baseline 367b8e6 vs this change): prefix cache hit rate 0.00% -> 81.54% TTFT avg 14593 ms -> 6099 ms (2.39x faster) TTFT P90 24798 ms -> 7019 ms (3.53x faster) TPOT avg 36.9 ms -> 27.9 ms (1.32x faster, NOT degraded) TPOT SLO_P90 46.0 ms -> 30.0 ms (1.53x faster) E2E time 53.1 s -> 37.0 s (1.44x faster) QPS 0.6023 -> 0.8658 (+43.75%) output throughput 617 tok/s -> 887 tok/s (+43.74%) prefill throughput 562 tok/s -> 1344 tok/s (2.39x) Second workload where the baseline already hit 44.53%: this change pushes the hit rate to 77.70% (+33.17 pp) with TTFT 1.66x faster and E2E throughput +16.65%, TPOT still not degraded (35.4 -> 33.7 ms). Note: at output_len=1024 the TPOT improves rather than regresses, so the old "more hits -> worse TPOT" shape no longer applies. Net effect on DSv4 long-context workloads: lifts prefix_cache_hit from ~0% (every replay paying cold-prefill cost) into a usable range, without changing the attention computation or KV layout outside the chosen block_size. Co-authored-by: wangzhao-11a <73340653+wangzhao-11a@users.noreply.github.com> Signed-off-by: liuchenbing <chenliumail@163.com>

github-actions · 2026-06-11T09:00:30Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-06-11T09:00:30Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

FutureSkyFly requested review from MengqingCao and wangxiyuan as code owners June 8, 2026 12:12

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

FutureSkyFly force-pushed the v4_prefix branch 2 times, most recently from c50e497 to 46913c7 Compare June 8, 2026 13:36

FutureSkyFly force-pushed the v4_prefix branch from 46913c7 to 646bb7a Compare June 10, 2026 01:51

FutureSkyFly requested review from weijinqian0 and whx-sjtu as code owners June 10, 2026 01:51

FutureSkyFly force-pushed the v4_prefix branch 2 times, most recently from aafece2 to b321c85 Compare June 10, 2026 05:16

FutureSkyFly force-pushed the v4_prefix branch from b321c85 to 5f66ba6 Compare June 10, 2026 05:17

github-actions Bot added merge-conflicts labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193

[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193
FutureSkyFly wants to merge 1 commit into
vllm-project:releases/v0.20.2rcfrom
FutureSkyFly:v4_prefix

FutureSkyFly commented Jun 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

jhonzy0928 commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FutureSkyFly commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Symptom

Root cause

Fix

What changed

NPU production measurements

Prefix cache hit rate + serving performance

Second workload (baseline already partially hitting)

Accuracy (regression check)

Service environment used during validation

Expected impact

Does this PR introduce any user-facing change?

References

Uh oh!

gemini-code-assist Bot commented Jun 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

jhonzy0928 commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FutureSkyFly commented Jun 8, 2026 •

edited

Loading