[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193
[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193FutureSkyFly wants to merge 1 commit into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance degradation in DSv4 models where long-context requests suffer from poor prefix cache hit rates. The issue stems from the vLLM v0.20.2 base lacking an upstream retention mechanism, causing cached blocks to be evicted by concurrent scratch block allocations. By porting the necessary logic and adapting it for the vllm-ascend platform through platform-specific monkey-patches, this change ensures that tail blocks are preserved, thereby improving cache efficiency. The solution is configurable and maintains bit-for-bit parity with legacy behavior when the new retention interval is unset. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the KV cache management and coordinator logic in vllm-ascend to support DeepSeek-V4 compressed MLA with dynamic storage block sizes and prefix caching. Key changes include introducing storage_block_size and _prefix_block_size helpers, dynamically resolving block sizes from environment variables, updating memory usage and page size calculations, and adding extensive debug logging. The code review feedback highlights a critical bug in AscendHybridKVCacheCoordinator._prefix_block_size where UniformTypeKVCacheSpecs is not unpacked, leading to incorrect block size alignment and cache misses. Additionally, the reviewer noted that the PR title and description should be updated to conform to the repository's Pull Request Summary Style Guide.
| @staticmethod | ||
| def _prefix_block_size(spec: KVCacheSpec) -> int: | ||
| compress_ratio = max(getattr(spec, "compress_ratio", 1), 1) | ||
| storage_block_size = getattr(spec, "storage_block_size", None) | ||
| if compress_ratio > 1 and storage_block_size is not None: | ||
| return storage_block_size * compress_ratio | ||
| return spec.block_size |
There was a problem hiding this comment.
In AscendHybridKVCacheCoordinator._prefix_block_size, if spec is an instance of UniformTypeKVCacheSpecs, it does not have compress_ratio or storage_block_size as direct attributes. This causes getattr(spec, "compress_ratio", 1) to return 1 and fall back to spec.block_size (e.g., 128), instead of computing the correct prefix block size (e.g., 2048 for C128).
This will lead to an incorrect self.lcm_block_size and cause alignment issues or cache misses in find_longest_cache_hit.
We should unpack UniformTypeKVCacheSpecs first, matching the logic used in model_runner_v1.py.
@staticmethod
def _prefix_block_size(spec: KVCacheSpec) -> int:
from vllm.v1.kv_cache_interface import UniformTypeKVCacheSpecs
if isinstance(spec, UniformTypeKVCacheSpecs):
spec = next(iter(spec.kv_cache_specs.values()))
compress_ratio = max(getattr(spec, "compress_ratio", 1), 1)
storage_block_size = getattr(spec, "storage_block_size", None)
if compress_ratio > 1 and storage_block_size is not None:
return storage_block_size * compress_ratio
return spec.block_size| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| import itertools | ||
| import os |
There was a problem hiding this comment.
The Pull Request title and description do not adhere to the repository's Pull Request Summary Style Guide.
Please update them to match the required formats:
Suggested PR Title:
[v0.20.2rc][Attention][BugFix] Port DSv4 SWA prefix-cache retention to vllm-ascendSuggested PR Summary:
### What this PR does / why we need it?
This PR ports the DeepSeek-V4 SWA prefix-cache retention mechanism to `vllm-ascend` on the `releases/v0.20.2rc` branch. It addresses the near-zero prefix cache hit rate at 16K+ contexts by implementing:
1. **Sparse checkpoint retention**: Only caching blocks at aligned boundaries plus the latest replay tail, reducing cached blocks per request from ~128 to ~1-2.
2. **Cached-vs-uncached free-queue split**: Routing uncached scratch blocks to the front of the free queue (reused first) and cached blocks to the back (surviving scratch churn).
### Does this PR introduce _any_ user-facing change?
No, when `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` is unset. When set, it optimizes prefix cache retention behavior without changing block contents or attention computation.
### How was this patch tested?
Tested on Ascend A3 (8 cards, TP8) with DeepSeek-V4-Flash-w8a8-mtp.
- Verified prefix cache hit rate of 49.67% on aisbench prefix-cache workload.
- Verified accuracy on GPQA Diamond (198 questions, 0-shot CoT) with 89.90% accuracy (no regression).References
- The Pull Request Summary Style Guide requires the PR title to follow the format '[Branch][Module][Action] Pull Request Title' and the summary to follow a specific H3-based markdown template. (link)
c50e497 to
46913c7
Compare
aafece2 to
b321c85
Compare
…ention + partial hit
Lifts the near-zero prefix-cache hit rate on DSv4 long-context
workloads served on v0.20.2rc by combining five mechanisms.
1. Configurable DSv4 compressor block size (32 / 64 / 128)
- New lookup table DSV4_BLOCK_SIZES driven by --block-size CLI flag.
- Synchronously updates: MLA block_size, SWA block_size, C4 state
cache, C128 state cache, page_size_padded.
- utils.refresh_block_size caches the resolved size in
_ENIGINE_CORE_BLOCK_SIZE so multi-process EngineCore does not
overwrite it. Defaults to 32 when unset.
2. AscendHybridKVCacheCoordinator interface alignment
- __init__ accepts and threads local_kv_retention_interval,
scheduler_block_size and eagle_attn_layer_names kwargs through to
super(), so v0.22+ vllm with PR #43447 can bind directly without
TypeError.
3. vLLM prefix-cache core mechanism backport (monkey-patch)
- New patch_prefix_cache_core.py installs at import time:
FreeKVCacheBlockQueue.prepend_n
BlockPool.free_blocks(prepend=)
BlockPool._maybe_evict_cached_block hook
SlidingWindowManager.remove_skipped_blocks (cached / uncached split)
SlidingWindowManager._cache_block_mask (sparse checkpoint)
SlidingWindowManager.free (cached / uncached split)
KVCacheManager.__init__ (scheduler_block_size plumbing)
KVCacheManager.take_copy_block_ids (partial-hit copy queue)
Scheduler.__init__ (scheduler_block_size derivation, lcm of group sizes)
Scheduler copy-blocks dispatch
- Guarded by _source_contains() so vllm that already ships the
mechanism is not double-patched.
4. Scheduler block size alignment
- scheduler_block_size = lcm(block sizes across all kv-cache groups)
is threaded from Scheduler -> KVCacheManager -> Coordinator.
- Fixes admission gates and prefix-hit length checks that mis-aligned
when different KV groups use different block sizes (e.g. compressor
at 32 + main MLA at 128).
5. DSv4 compressor partial prefix-cache hits
- New ComputedBlockList class plus _hash_range / _insert_partial_cache
/ get_partial_cached_block helpers in single_type_kv_cache_manager.
- CompressAttentionManager learns _num_partial_hit_blocks /
_cache_partial_block_boundaries / take_copy_block_ids, and registers
each request's partial-prefix boundaries exactly once (avoids the
per-decode-step re-hash that previously stalled the scheduler loop).
- Block-pool eviction cleans up partial cache entries for evicted
blocks.
- model_runner_v1 gains _copy_prefix_cache_blocks /
_prefix_cache_data_ptrs / _copy_prefix_cache_tensor to perform
the (src_block_id, dst_block_id, num_tokens) copies on-device.
- Effect: when a new request shares only part of a 16K-token
compressor block with a cached request, the shared prefix is
reused via an in-cache memcpy instead of being recomputed.
Activation:
vllm serve --block-size 32 ...
# 32 / 64 / 128, defaults to 32
# optional upstream-aligned env:
VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto
Measured performance (Ascend A3 / TP8, DeepSeek-V4-Flash-w8a8-mtp,
input 8192 x output 1024 x concurrency 32 x repeat 0.9; baseline
367b8e6 vs this change):
prefix cache hit rate 0.00% -> 81.54%
TTFT avg 14593 ms -> 6099 ms (2.39x faster)
TTFT P90 24798 ms -> 7019 ms (3.53x faster)
TPOT avg 36.9 ms -> 27.9 ms (1.32x faster, NOT degraded)
TPOT SLO_P90 46.0 ms -> 30.0 ms (1.53x faster)
E2E time 53.1 s -> 37.0 s (1.44x faster)
QPS 0.6023 -> 0.8658 (+43.75%)
output throughput 617 tok/s -> 887 tok/s (+43.74%)
prefill throughput 562 tok/s -> 1344 tok/s (2.39x)
Second workload where the baseline already hit 44.53%: this change
pushes the hit rate to 77.70% (+33.17 pp) with TTFT 1.66x faster and
E2E throughput +16.65%, TPOT still not degraded (35.4 -> 33.7 ms).
Note: at output_len=1024 the TPOT improves rather than regresses, so the
old "more hits -> worse TPOT" shape no longer applies.
Net effect on DSv4 long-context workloads: lifts prefix_cache_hit
from ~0% (every replay paying cold-prefill cost) into a usable range,
without changing the attention computation or KV layout outside the
chosen block_size.
Co-authored-by: wangzhao-11a <73340653+wangzhao-11a@users.noreply.github.com>
Signed-off-by: liuchenbing <chenliumail@163.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |

What this PR does / why we need it?
Lifts DSv4 SWA prefix-cache hit rate on vllm-ascend
releases/v0.20.2rcfrom near-zero to ~50% by allowing the DSv4 compressor's KV-cache
block_sizeto be configured per-deployment instead of being hard-pinnedto 128 tokens.
Symptom
Root cause
The DSv4 compressor's KV-cache block size is hard-coded to 128 tokens
in two places:
vllm_ascend/models/layer/attention/layer.py::DSAAttention.get_kv_cache_spec()returns
MLAAttentionSpec(block_size=128, ...).vllm_ascend/patch/worker/patch_deepseek_compressor.py::AscendDeepseekV4IndexerCache.get_kv_cache_spec()returns
AscendMLAAttentionSpec(block_size=128, ...).Prompts whose token count is not an exact multiple of 128 leave their
tail block uncached. Under the standard prefix-matching loop, the moment
one tail block is uncached the matcher stops returning more blocks past
that point, so the entire trailing region of the prompt also misses
even when the body would otherwise have hit. Concurrent long-context
workloads aggravate this because every request has its own un-aligned
tail.
Fix
Make the DSv4 compressor's KV-cache
block_sizeconfigurable through anew env switch (default 128, so legacy behaviour is preserved
bit-for-bit). At
block_size = 32the alignment requirement shrinks by4×, so 4× more prompt prefixes can match exactly.
Single env switch:
What changed
patch_kv_cache_interface.py— newget_dsv4_compressed_kv_block_size(compress_ratio)resolver. Reads the env, validates the value is in
{32, 64, 128},short-circuits to default
128whencompress_ratio != 128.Relaxes the DSV4 C128
block_size % compress_ratio == 0check (whichpreviously rejected
block_size < compress_ratio).models/layer/attention/layer.py—DSAAttention.get_kv_cache_spec()calls the new resolver instead ofhard-coding 128.
patch/worker/patch_deepseek_compressor.py—AscendDeepseekV4IndexerCache.get_kv_cache_spec()calls the sameresolver so the indexer cache stays consistent with the DSA attention
cache spec.
patch_kv_cache_utils.py— KV-cache page size is now aggregatedacross all KV groups (not only the first full-MLA group), and the
KVCacheTensorallocator skips tensors whoseshared_byis empty.Required because at
block_size = 32different KV groups can havedifferent page sizes.
core/single_type_kv_cache_manager.py—CompressAttentionManagertracks_prefix_block_sizeseparately fromthe allocation block_size. Prefix hashing keeps using the original
token range so cache entries stay comparable across deployments with
different compressed block sizes.
worker/model_runner_v1.py—kv_cache_raw_tensorslayer-namemismatch now raises a structured error with missing / extra / tensor
summary instead of bare
assert. Diagnostic-only; no inference mathchange.
patch_kv_cache_coordinator.py+worker/block_table.py—small
_prefix_block_sizeplumbing and debug-print updates to keepthe multi-KV-group page-size logic legible.
NPU production measurements
Validated on Ascend A3 (TP8), DeepSeek-V4-Flash-w8a8-mtp, served on
vllm v0.20.2. Workload
input 8192 x output 1024 x concurrency 32 x repeat 0.9, baseline367b8e62vs this PR.Prefix cache hit rate + serving performance
0.00%81.54%14593.2 ms6098.9 ms24798.3 ms7018.8 ms36.9 ms27.9 ms46.0 ms30.0 ms53.13 s36.96 s0.60230.8658616.8 tok/s886.6 tok/s5553.6 tok/s7982.9 tok/s561.6 tok/s1343.8 tok/sKey point: at
output_len=1024the TPOT improves rather thanregresses, so the old "more hits -> worse TPOT" shape no longer applies.
Second workload (baseline already partially hitting)
Where the baseline already hit 44.53% on the same workload, this PR
pushes it to 77.70% (+33.17 pp), still with no TPOT degradation:
44.53%77.70%14298.8 ms8623.7 ms35.4 ms33.7 ms20945.7 tok/s24432.4 tok/s2291.9 tok/s3800.2 tok/sAccuracy (regression check)
GPQA Diamond (198 questions,
temperature=0, 0-shot CoT):GPQA_diamond.accuracy89.898989898989989.90198 / 1980GPQA is a non-repeating workload (does not consume prefix-cache entries);
it is the precision regression check. Accuracy matches the production
baseline -- no precision regression from the smaller KV-cache block size.
Service environment used during validation
Ascend A3 (8 cards, TP8)quay.io/ascend/vllm-ascend:nightly-releases-v0.20.2rc-a3DeepSeek-V4-Flash-w8a8-mtp--max-model-len135000--max-num-batched-tokens8192--tensor-parallel-size8--enable-expert-parallelExpected impact
repeat_rate=0.9, c32, 8K)0% -> 81.54%block_size = 32floor; no caching possible either wayDoes this PR introduce any user-facing change?
No when
VLLM_ASCEND_DSV4_COMPRESSED_KV_BLOCK_SIZEis unset or set to128. The resolver returns the legacy128, the cache spec isbit-for-bit identical, and the rest of the modified files take their
no-op default branches.
When set to
32or64, the only effective change is the page-layoutof the DSv4 compressor KV cache. Block contents, the cache lookup
math, and the attention compute itself are untouched. A previously
cacheable replay either still hits (its tail block survives) or misses
(its tail block landed in a smaller bucket that was already evicted);
on miss the request recomputes the same tokens it would have
recomputed under any cache miss.
References
bc150f50299199599673614f80d12a196f377655)