Skip to content

[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193

Open
FutureSkyFly wants to merge 1 commit into
vllm-project:releases/v0.20.2rcfrom
FutureSkyFly:v4_prefix
Open

[BugFix][v0.20.2rc] Port DSv4 SWA prefix-cache retention to vllm-ascend#10193
FutureSkyFly wants to merge 1 commit into
vllm-project:releases/v0.20.2rcfrom
FutureSkyFly:v4_prefix

Conversation

@FutureSkyFly

@FutureSkyFly FutureSkyFly commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

Lifts DSv4 SWA prefix-cache hit rate on vllm-ascend releases/v0.20.2rc
from near-zero to ~50% by allowing the DSv4 compressor's KV-cache
block_size to be configured per-deployment instead of being hard-pinned
to 128 tokens.

Symptom

--max-model-len 16384+, --enable-prefix-caching
8K-input × concurrency 32 × repeat 0.9 sequential replay:
  prefix_cache_hit:        ~0%
  TTFT avg:                ~16400 ms (= cold prefill)

Root cause

The DSv4 compressor's KV-cache block size is hard-coded to 128 tokens
in two places:

  • vllm_ascend/models/layer/attention/layer.py::DSAAttention.get_kv_cache_spec()
    returns MLAAttentionSpec(block_size=128, ...).
  • vllm_ascend/patch/worker/patch_deepseek_compressor.py::AscendDeepseekV4IndexerCache.get_kv_cache_spec()
    returns AscendMLAAttentionSpec(block_size=128, ...).

Prompts whose token count is not an exact multiple of 128 leave their
tail block uncached. Under the standard prefix-matching loop, the moment
one tail block is uncached the matcher stops returning more blocks past
that point, so the entire trailing region of the prompt also misses
even when the body would otherwise have hit. Concurrent long-context
workloads aggravate this because every request has its own un-aligned
tail.

Fix

Make the DSv4 compressor's KV-cache block_size configurable through a
new env switch (default 128, so legacy behaviour is preserved
bit-for-bit). At block_size = 32 the alignment requirement shrinks by
4×, so 4× more prompt prefixes can match exactly.

Single env switch:

VLLM_ASCEND_DSV4_COMPRESSED_KV_BLOCK_SIZE=32 vllm serve ...
# valid: 32 / 64 / 128
# unset -> legacy 128 behaviour
# only effective when compress_ratio == 128

What changed

  • patch_kv_cache_interface.py — new get_dsv4_compressed_kv_block_size(compress_ratio)
    resolver. Reads the env, validates the value is in {32, 64, 128},
    short-circuits to default 128 when compress_ratio != 128.
    Relaxes the DSV4 C128 block_size % compress_ratio == 0 check (which
    previously rejected block_size < compress_ratio).
  • models/layer/attention/layer.py
    DSAAttention.get_kv_cache_spec() calls the new resolver instead of
    hard-coding 128.
  • patch/worker/patch_deepseek_compressor.py
    AscendDeepseekV4IndexerCache.get_kv_cache_spec() calls the same
    resolver so the indexer cache stays consistent with the DSA attention
    cache spec.
  • patch_kv_cache_utils.py — KV-cache page size is now aggregated
    across all KV groups (not only the first full-MLA group), and the
    KVCacheTensor allocator skips tensors whose shared_by is empty.
    Required because at block_size = 32 different KV groups can have
    different page sizes.
  • core/single_type_kv_cache_manager.py
    CompressAttentionManager tracks _prefix_block_size separately from
    the allocation block_size. Prefix hashing keeps using the original
    token range so cache entries stay comparable across deployments with
    different compressed block sizes.
  • worker/model_runner_v1.pykv_cache_raw_tensors layer-name
    mismatch now raises a structured error with missing / extra / tensor
    summary instead of bare assert. Diagnostic-only; no inference math
    change.
  • patch_kv_cache_coordinator.py + worker/block_table.py
    small _prefix_block_size plumbing and debug-print updates to keep
    the multi-KV-group page-size logic legible.

NPU production measurements

Validated on Ascend A3 (TP8), DeepSeek-V4-Flash-w8a8-mtp, served on
vllm v0.20.2. Workload input 8192 x output 1024 x concurrency 32 x repeat 0.9, baseline 367b8e62 vs this PR.

Prefix cache hit rate + serving performance

Metric baseline this PR Gain
prefix cache hit rate 0.00% 81.54% hits
TTFT avg 14593.2 ms 6098.9 ms 2.39x faster
TTFT P90 24798.3 ms 7018.8 ms 3.53x faster
TPOT avg 36.9 ms 27.9 ms 1.32x faster (not degraded)
TPOT SLO_P90 46.0 ms 30.0 ms 1.53x faster
E2E time 53.13 s 36.96 s 1.44x faster
QPS 0.6023 0.8658 +43.75%
output throughput 616.8 tok/s 886.6 tok/s +43.74%
E2E throughput 5553.6 tok/s 7982.9 tok/s +43.74%
prefill token throughput 561.6 tok/s 1343.8 tok/s 2.39x

Key point: at output_len=1024 the TPOT improves rather than
regresses, so the old "more hits -> worse TPOT" shape no longer applies.

Second workload (baseline already partially hitting)

Where the baseline already hit 44.53% on the same workload, this PR
pushes it to 77.70% (+33.17 pp), still with no TPOT degradation:

Metric baseline this PR Gain
prefix cache hit rate 44.53% 77.70% +33.17 pp
TTFT avg 14298.8 ms 8623.7 ms 1.66x faster
TPOT avg 35.4 ms 33.7 ms not degraded
E2E throughput 20945.7 tok/s 24432.4 tok/s +16.65%
prefill token throughput 2291.9 tok/s 3800.2 tok/s +65.81%

Accuracy (regression check)

GPQA Diamond (198 questions, temperature=0, 0-shot CoT):

item value
GPQA_diamond.accuracy 89.8989898989899
summary 89.90
completed 198 / 198
failed 0

GPQA is a non-repeating workload (does not consume prefix-cache entries);
it is the precision regression check. Accuracy matches the production
baseline -- no precision regression from the smaller KV-cache block size.

Service environment used during validation

item value
platform Ascend A3 (8 cards, TP8)
image quay.io/ascend/vllm-ascend:nightly-releases-v0.20.2rc-a3
model weights DeepSeek-V4-Flash-w8a8-mtp
--max-model-len 135000
--max-num-batched-tokens 8192
--tensor-parallel-size 8
--enable-expert-parallel yes

Expected impact

Scenario hit-rate change Reason
Prompts already 128-aligned unchanged No new tail to recover
Prompts with 1-127 token tail past the last 128-boundary improves The tail rounds to a 32-block boundary instead of being thrown away
Repeat-prompt traffic (repeat_rate=0.9, c32, 8K) 0% -> 81.54% Validated above
Short context (< 32 tokens) unchanged Below block_size = 32 floor; no caching possible either way

Does this PR introduce any user-facing change?

No when VLLM_ASCEND_DSV4_COMPRESSED_KV_BLOCK_SIZE is unset or set to
128. The resolver returns the legacy 128, the cache spec is
bit-for-bit identical, and the rest of the modified files take their
no-op default branches.

When set to 32 or 64, the only effective change is the page-layout
of the DSv4 compressor KV cache. Block contents, the cache lookup
math, and the attention compute itself are untouched. A previously
cacheable replay either still hits (its tail block survives) or misses
(its tail block landed in a smaller bucket that was already evicted);
on miss the request recomputes the same tokens it would have
recomputed under any cache miss.

References

  • vLLM version: v0.20.2 (commit bc150f50299199599673614f80d12a196f377655)

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance degradation in DSv4 models where long-context requests suffer from poor prefix cache hit rates. The issue stems from the vLLM v0.20.2 base lacking an upstream retention mechanism, causing cached blocks to be evicted by concurrent scratch block allocations. By porting the necessary logic and adapting it for the vllm-ascend platform through platform-specific monkey-patches, this change ensures that tail blocks are preserved, thereby improving cache efficiency. The solution is configurable and maintains bit-for-bit parity with legacy behavior when the new retention interval is unset.

Highlights

  • Prefix Cache Retention: Ported the DSv4 sliding-window prefix-cache retention mechanism to vllm-ascend to prevent premature eviction of cached blocks during scratch allocations.
  • Non-Intrusive Implementation: Implemented the fix using import-time monkey-patches to ensure compatibility with the pinned vLLM v0.20.2 core without requiring direct modifications.
  • Configuration: Introduced the VLLM_PREFIX_CACHE_RETENTION_INTERVAL environment variable to allow users to control retention behavior, with an 'auto' mode for optimal defaults.
  • Performance Optimization: Resolved the near-zero prefix cache hit rate issue for 16K+ context replays, significantly reducing replay latency.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the KV cache management and coordinator logic in vllm-ascend to support DeepSeek-V4 compressed MLA with dynamic storage block sizes and prefix caching. Key changes include introducing storage_block_size and _prefix_block_size helpers, dynamically resolving block sizes from environment variables, updating memory usage and page size calculations, and adding extensive debug logging. The code review feedback highlights a critical bug in AscendHybridKVCacheCoordinator._prefix_block_size where UniformTypeKVCacheSpecs is not unpacked, leading to incorrect block size alignment and cache misses. Additionally, the reviewer noted that the PR title and description should be updated to conform to the repository's Pull Request Summary Style Guide.

Comment on lines +113 to +119
@staticmethod
def _prefix_block_size(spec: KVCacheSpec) -> int:
compress_ratio = max(getattr(spec, "compress_ratio", 1), 1)
storage_block_size = getattr(spec, "storage_block_size", None)
if compress_ratio > 1 and storage_block_size is not None:
return storage_block_size * compress_ratio
return spec.block_size

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In AscendHybridKVCacheCoordinator._prefix_block_size, if spec is an instance of UniformTypeKVCacheSpecs, it does not have compress_ratio or storage_block_size as direct attributes. This causes getattr(spec, "compress_ratio", 1) to return 1 and fall back to spec.block_size (e.g., 128), instead of computing the correct prefix block size (e.g., 2048 for C128).

This will lead to an incorrect self.lcm_block_size and cause alignment issues or cache misses in find_longest_cache_hit.

We should unpack UniformTypeKVCacheSpecs first, matching the logic used in model_runner_v1.py.

    @staticmethod
    def _prefix_block_size(spec: KVCacheSpec) -> int:
        from vllm.v1.kv_cache_interface import UniformTypeKVCacheSpecs
        if isinstance(spec, UniformTypeKVCacheSpecs):
            spec = next(iter(spec.kv_cache_specs.values()))
        compress_ratio = max(getattr(spec, "compress_ratio", 1), 1)
        storage_block_size = getattr(spec, "storage_block_size", None)
        if compress_ratio > 1 and storage_block_size is not None:
            return storage_block_size * compress_ratio
        return spec.block_size

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools
import os

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Pull Request title and description do not adhere to the repository's Pull Request Summary Style Guide.

Please update them to match the required formats:

Suggested PR Title:

[v0.20.2rc][Attention][BugFix] Port DSv4 SWA prefix-cache retention to vllm-ascend

Suggested PR Summary:

### What this PR does / why we need it?

This PR ports the DeepSeek-V4 SWA prefix-cache retention mechanism to `vllm-ascend` on the `releases/v0.20.2rc` branch. It addresses the near-zero prefix cache hit rate at 16K+ contexts by implementing:
1. **Sparse checkpoint retention**: Only caching blocks at aligned boundaries plus the latest replay tail, reducing cached blocks per request from ~128 to ~1-2.
2. **Cached-vs-uncached free-queue split**: Routing uncached scratch blocks to the front of the free queue (reused first) and cached blocks to the back (surviving scratch churn).

### Does this PR introduce _any_ user-facing change?

No, when `VLLM_PREFIX_CACHE_RETENTION_INTERVAL` is unset. When set, it optimizes prefix cache retention behavior without changing block contents or attention computation.

### How was this patch tested?

Tested on Ascend A3 (8 cards, TP8) with DeepSeek-V4-Flash-w8a8-mtp.
- Verified prefix cache hit rate of 49.67% on aisbench prefix-cache workload.
- Verified accuracy on GPQA Diamond (198 questions, 0-shot CoT) with 89.90% accuracy (no regression).
References
  1. The Pull Request Summary Style Guide requires the PR title to follow the format '[Branch][Module][Action] Pull Request Title' and the summary to follow a specific H3-based markdown template. (link)

@FutureSkyFly FutureSkyFly force-pushed the v4_prefix branch 2 times, most recently from c50e497 to 46913c7 Compare June 8, 2026 13:36
@jhonzy0928

Copy link
Copy Markdown

In scenarios involving separate PDs (8 machines and a large EP), a phenomenon occurs where prefill is completed, but decoding loops infinitely, preventing inference from completing.
pd

…ention + partial hit

Lifts the near-zero prefix-cache hit rate on DSv4 long-context
workloads served on v0.20.2rc by combining five mechanisms.

1. Configurable DSv4 compressor block size (32 / 64 / 128)
   - New lookup table DSV4_BLOCK_SIZES driven by --block-size CLI flag.
   - Synchronously updates: MLA block_size, SWA block_size, C4 state
     cache, C128 state cache, page_size_padded.
   - utils.refresh_block_size caches the resolved size in
     _ENIGINE_CORE_BLOCK_SIZE so multi-process EngineCore does not
     overwrite it. Defaults to 32 when unset.

2. AscendHybridKVCacheCoordinator interface alignment
   - __init__ accepts and threads local_kv_retention_interval,
     scheduler_block_size and eagle_attn_layer_names kwargs through to
     super(), so v0.22+ vllm with PR #43447 can bind directly without
     TypeError.

3. vLLM prefix-cache core mechanism backport (monkey-patch)
   - New patch_prefix_cache_core.py installs at import time:
       FreeKVCacheBlockQueue.prepend_n
       BlockPool.free_blocks(prepend=)
       BlockPool._maybe_evict_cached_block hook
       SlidingWindowManager.remove_skipped_blocks (cached / uncached split)
       SlidingWindowManager._cache_block_mask (sparse checkpoint)
       SlidingWindowManager.free (cached / uncached split)
       KVCacheManager.__init__ (scheduler_block_size plumbing)
       KVCacheManager.take_copy_block_ids (partial-hit copy queue)
       Scheduler.__init__ (scheduler_block_size derivation, lcm of group sizes)
       Scheduler copy-blocks dispatch
   - Guarded by _source_contains() so vllm that already ships the
     mechanism is not double-patched.

4. Scheduler block size alignment
   - scheduler_block_size = lcm(block sizes across all kv-cache groups)
     is threaded from Scheduler -> KVCacheManager -> Coordinator.
   - Fixes admission gates and prefix-hit length checks that mis-aligned
     when different KV groups use different block sizes (e.g. compressor
     at 32 + main MLA at 128).

5. DSv4 compressor partial prefix-cache hits
   - New ComputedBlockList class plus _hash_range / _insert_partial_cache
     / get_partial_cached_block helpers in single_type_kv_cache_manager.
   - CompressAttentionManager learns _num_partial_hit_blocks /
     _cache_partial_block_boundaries / take_copy_block_ids, and registers
     each request's partial-prefix boundaries exactly once (avoids the
     per-decode-step re-hash that previously stalled the scheduler loop).
   - Block-pool eviction cleans up partial cache entries for evicted
     blocks.
   - model_runner_v1 gains _copy_prefix_cache_blocks /
     _prefix_cache_data_ptrs / _copy_prefix_cache_tensor to perform
     the (src_block_id, dst_block_id, num_tokens) copies on-device.
   - Effect: when a new request shares only part of a 16K-token
     compressor block with a cached request, the shared prefix is
     reused via an in-cache memcpy instead of being recomputed.

Activation:
   vllm serve --block-size 32 ...
   # 32 / 64 / 128, defaults to 32
   # optional upstream-aligned env:
   VLLM_PREFIX_CACHE_RETENTION_INTERVAL=auto

Measured performance (Ascend A3 / TP8, DeepSeek-V4-Flash-w8a8-mtp,
input 8192 x output 1024 x concurrency 32 x repeat 0.9; baseline
367b8e6 vs this change):

  prefix cache hit rate    0.00%   -> 81.54%
  TTFT avg                 14593 ms -> 6099 ms   (2.39x faster)
  TTFT P90                 24798 ms -> 7019 ms   (3.53x faster)
  TPOT avg                 36.9 ms  -> 27.9 ms   (1.32x faster, NOT degraded)
  TPOT SLO_P90             46.0 ms  -> 30.0 ms   (1.53x faster)
  E2E time                 53.1 s   -> 37.0 s    (1.44x faster)
  QPS                      0.6023   -> 0.8658    (+43.75%)
  output throughput        617 tok/s -> 887 tok/s (+43.74%)
  prefill throughput       562 tok/s -> 1344 tok/s (2.39x)

Second workload where the baseline already hit 44.53%: this change
pushes the hit rate to 77.70% (+33.17 pp) with TTFT 1.66x faster and
E2E throughput +16.65%, TPOT still not degraded (35.4 -> 33.7 ms).

Note: at output_len=1024 the TPOT improves rather than regresses, so the
old "more hits -> worse TPOT" shape no longer applies.

Net effect on DSv4 long-context workloads: lifts prefix_cache_hit
from ~0% (every replay paying cold-prefill cost) into a usable range,
without changing the attention computation or KV layout outside the
chosen block_size.

Co-authored-by: wangzhao-11a <73340653+wangzhao-11a@users.noreply.github.com>
Signed-off-by: liuchenbing <chenliumail@163.com>
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants